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ABSTRACT 

An investigation was conducted to determine the 
effects of response-category weighting and item weighting on 
reliability and predictive validity* Response-category weighting 
refers to scoring in which, for each category (including omit and 
"not read") , a weight is assigned that is proportional to the mean 
criterion score of examinees selecting that category* Item weighting 
refers to the application of multiple regression techniques to 
maximize the relationship between a composite of item scores and a 
criterion* The study of the effects of weighting on reliability 
indicated that scores resulting from response-category weighting %*ere 
significantly mere reliable than scores corrected for chance success* 
Response- category weighting in concert with item weighting resulted 
in scores significantly less reliable than scores corrected for 
chance success. The study of the effects of the flighting on 
predictive validity indicated that no gain in predictive validity 
accrued through the use of response-category weighting as opposed to 
scores corrected for chance success* Response-category weighting with 
item weighting resulted in scores significantly more reliable than 
scores corrected for chance success* (Author/CK) 
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ABSTRACT 



The primary objective of the investigation was to determine 
the effects of response-category weighting and item weighting on 
reliability and predictive validity. Response-category weighting 
refers to scoring in which, for each category (including omit and 
"not read") , a weight is assigned that is proportional to the mean 
criterion score of examinees selecting that category. Item weighting 
refers to the application of multiple regression techniques to max- 
imize the relationship between a composite of item scores and a 
criterion. 

The study of the effects of weighting on reliability indicated 
that scores resulting from response-category weighting were sig- 
nificantly more reliable than scores corrected for chance success. 
Response-category weighting in concert with item weighting resulted 
in scores significantly less reliable than scores corrected for 
chance success. 

The study of the effects of weighting on predictive validity 
indicated that no gain in predictive validity accrued through the 
use of response-category weighting as opposed to scores corrected 
for chance success. Response-category weighting with item weighting 
resulted in scores significantly more reliable than scores corrected 
for chance success. Further research is necessary to refine the 
application of response-category and item weighting to clarify 
interpretation of obtained weights. 
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CHAPTER I 
PROBLEMS AND OBJECTIVES 



When the reliability and predictive validity of a test are 
considered, the effects of examinee motivation, administrative circum- 
stances, and scoring procedures are often neglected when, in fact, 
they should not be. The investigator generally wants to determine as 
reliably as possible the rank ordering of a group of examinees on the 
composite of traits measured by a defined criterion variable. If the 
investigator is dissatisfied with the test's reliability or predictive 
validity, or both, several alternatives for improving these character- 
istics present themselves. Among other strategies, he may replace or 
revise some of the test items, he may improve the criterion measure 
with which the test scores are correlated, or he may score the test in 
a different manner. If the investigator already has n test made up of 
satisfactory items ixid a set of criterion scores that are both reliable 
and unbiased, he may still rescore the test with the hope of improving 
its efficiency. On<» scoring procedure that may be employed uses 
differential choice weights. The problem of differential weighting 
of only the correct responses in test items or of all choices ate 
usually considered separately. The weighting of these two entities 
usually can be classified into variable-weighting and fixed-weighting 
methods . 

In variable-weighting methods there is nc weight, constant over 
subjects, applied to a single item or item choice. In these methods 
each examinee provides subjective probability est^jmates of how confi- 
dent he is in making a choice. For example, DeFinett: (1965) proposed 
that an examinee's store of "partial information" be e^jtimated in 
terms of a subjective probability made by the examinee to indicate 
the likelihood that a choice that he has marked an correct iS, in fact, 
correct. Scoring items on this basis may, however, introduce the 
dimension of willingness to gamble on the part of the examinee 
(Swineford, 1941). After being trained in the tect-taking procedures, 
the examinee realises that he can get more credit f.-r marking an item 
correctly by indicating that he is sure of the correctness of his 
action than by indicating some lack of confidence in his decision. 
This procedure introduces an unintended variable into the scores so 
that the test may no longer measure the trait that it was designed to 
measure. Other limitations or shortcomings of these methods include 
the need for multiple responses per item, multiple scoring of answer 
sheets, and the examinee's difficulty in understanding how to take the 
test. 

Fixed weights, usually derived by multiple-regression procedures, 
refer to weights for application to all item choices. These are identi- 
cal for all choices in a given item and are constant for all examinees. 
Some research workers have suggested that fixed-weighting procedures 
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have maxlnum value when only nmall numbers of items are to be weighted. 
Fixed weights for each item choice are most commonly used when there 
is no correct choice; e.g., in personality and interest inventories. 
For each choice, a fixeii weijjht is generally derived on the basis of 
the correlation between tnt- Ving or not marking that choice and some 
criterion variable; e.g», performance on a Job, or membership in one 
of several defined groups* 

Although differential weighti.ng of test itf^s, item choices, 
or some combination thereof should, in theory, provide gains in test 
reliability and predictive validity, in practice only small gains 
generally result. It Is this result that has led sosae psychonetricians 
to conclude that differential weighting is not worthwhile (e.g., 
Guilford, 1954; Gulliksen, 1950). On the other hand, some investi- 
gators (Davis, 1959; Hendrickson, 1971; Reilly & Jackson, 1V72) have 
reported significantly imf.oved reliability coefficients by using 
weights for each choice in every item. 

The objective of the present study is to compare the relia- 
bility and Y*cedXctive validity of test scores when the scoring 
procedure is based on: 

1. a- pr iori weights of 1 for each correct response and (0 for 
each incorrect response or omission; 

2. a-prio ;ri weights of 1 for each corre ct response, -1/k-l 
for cezh incorrect response, and 0 for omission. This is 
the conventional procedure for correcting for chance 
success; 

3. cross-validated weights for every item response-category; 

4. cross-validated weights for every item response-category 
after th* weights nave been adjusted by means of cross- 
validated partial regression coefficients for predicting 
a defined criterion. 



Weighting Ite m Scores 

The reliability coefficient of a test, t, when ail variables 
are expressed in standard-score form, may be written as: 

r^j./ . _ ■ 

Z + ? ? w. r.^ 

Weighting the itemi of z test may affect the sample test s 3.11a- 
bility coefficient to the extent that the more reliable items are 
weighted more heavily than the less reliable items. Kellr?y has shown 
(1947, pp. 423-424) that r^^' **® maxlniit^id if the item ocores 
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are weighted by the inverse of their variance errors of measurement. 
For an item> i, the weight w^^ may be written as: 



1 



In practice, as a single dichotomously scored item varies from 
50-per-cent difficulty level in a sample, its variance and its relic - 
bility coefficient decrease, thus keeping its variance error of measu: 
ment fairly constant in value until the item approaches 0 or 100 per 
cent in difficulty. At either of these limiting values the item no 
longer differentiates among examinees in the sample tested; it is not 
differentiating one examinee from one another and has a variance and i 
variance error ot measurement of zero. As a consequence of the fact 
that the weights for items that are capable of maximizing the relia- 
bility coefficient of the test tend to remain the same for most items 
of the usual difficulty levels, it makes little difference with 
respect to test reliability whether the optimal weights are used or 
are not used. A number of empirical studies have confirmed the 
conclusions of the analytic formulation of the problem given above. 
These studies are summarized in the chapter that presents a review 
of the literature. 

In the general case, the correlation of a weighted sum (ws) 
with an independent variable (c) is: 



More specifically, this equation can be considered to yield a predic- 
tive validity coefficient of a test composed of i items with some 
criterion c. To maximize the relationship R(c) (ws) ^^^^ validity 
coefficient), the proper weights (wi) are the multiple regression 
coefficients (beta weights) (3j > • • • >3i) for each item in the test 
being weighted. The extent to which the multiple-correlation coeffi-- 
cient will exceed the zero-order correlation of the unweighted sum of 
the test items with the criterion (after cross-validation) depends 
largely on the degree to which the itans differ with respect to their 
correlations with the criterion variable and with each other. If the 
items in the test are homogeneous in content, the use of multiple- 
regression weights is not likely to result in an appreciable gain in 
test validity. On the other hand, if the test items are heterogeneous 
(as they are in some cases because they are components of a test that 
properly measures a complex function), the multiple correlation 
coefficient might be considerably higher than the zero-order coefficient 



E^ci^i^^i 



^(c) (ws) 




^(c)(8)* Empirical studies bearing on this point are discussed in the 
chapter that presents a review of the litetature« 



Weighting Item Response Categories 

If differential weights are assigned to each response category 
in a multiple-choice item, the number of score categories may be 
increased beyond the dichotomy of "passing" or "failing" the item. 
For example, with 5-choice items in which each choice has a different 
weight, an examinee may receive any one of five different item scores 
by marking one of the five choices. However, two other response cate- 
gories are available to him; he may read the item and choose to refrain 
from marking an answer to it or he may work at a rate slow enough so 
that he does not have time to read a given item in the time limit. 
Since scoring weights can be assigned to these response categories, an 
examinee may obtain any one of seven scores for a 5-choice item. 

Guttman showed (1941) that the correlation ratio between a set 
of scores on one item (when these scores take the form of ntimerical 
values assigned to the item response categories) and a set of criterion 
scores can be maximized by assigning to each item response category a 
value proportional to the mean criterion score of the examinees \rfio 
fall in that category. This general least-squares mathematical model 
for obtaining weights that maximize internal consistency falls under 
the general heading of scaling. Torgerson (1958) has provided a 
comprehensive review of these techniques^ including Guttman* s method, 
which he categorizes as a nethod of scaling principal components. 

In the present study, Guttman *s procedure has been generalized 
from its application to questionnaires to obtaining weights for all 
response categories available to examinees who take aptitude and 
achievement tests. This involves having a scoring weight for each 
choice in a multiple-choice item, a scoring weight for readiag each 
item and refraining from marking an answer to it, and a scoring weight 
for not reading the item during the time limit. By including the last 
two response categories, the scoring system is able to take partially 
into account such components as personality factors, test-taking 
strategies, and rate-of-work determinants. 

Guttman (1941) outlined an analytical procedure for obtaining 
the '•best" set of numerical weights for each choice in a series of 
multiple-choice items in the sense that the choice weights would yield 
the maximum correlation ratio between the sum of weighted item scores 
and the criterion variable. 

The main consideration of the present investigation is the 
application of Guttman *s scaling method to multiple-choice items that, 
unlike the items considered by Guttman, have a keyed "correct" answer 
or response. The effects of this scaling procedure, applied to aptitud 
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or achievement-test items, can be viewed in terms of the changes in the 
test's reliability and predictive validity • 

Concern over the question of the information carried in the 
choice among wrong responses in a given test item is evidenced iu the 
literature* In a paper by Powell (1968), the question of the function- 
al role of wrong answers in multiple--choice tests was the main con- 
cern. Powell was particularly interested in the amount of potentially 
useful information that is lost when all distracters of an item are 
considered in the general category of "wrong responses." Powell, like 
Davis (1959), observed "..•much time is spent. in the preparation of 
foils for multiple-choice tests. And a proportionally large amount of 
time is spent by the examinee in making his selection decisions among the 
alternatives (p. 403)*" From these observations, Powell conjectured 
that the 'Wong"- answers may indeed have as much discriminating power 
as the "right" ans^^ers. 

The present study employs an item response-category weighting 
method that is a modification of the method originally proposed by 
Guttman (19A1) and is concerned with the effects of item response- 
category weighting on the reliability and predictive validity of 
reading tests that measure largely verbal aptitude. The value of 
the response-category weighting methods described herein is judged 
in terms of practical as well as statistical significance. 



CHAPTER II 



REVIEW OF THE LITERATURE 



Literature pertaining to two applications of fixed weighting 
procedures is presented in this review. The first application deals 
with uniform weighting of test items by applying the same weight to 
all response categories for the item. The second deals with the differ- 
ential weighting of response categories for an item. 



Weighting Test Items 

In general, when a uniform weight is applied to all response 
categories in an item, the items themselves are usually scored in a 
conventional manner. That is, the items are usually scored "pass" or 
"fail," with a score of 1 being applied in the former case and a score 
of 0 being applied in the latter, by the application of the correction- 
for-guessing formula, a 1 being assigned to a correct choice, a negative 
score -l/(k-l) being applied to an incorrect choice, and a 0 being 
applied to an omitted item. Ordinarily the total test score for an 
examinee is obtained by summing the item scores over all items in the 
test. 

The numerous empirical studies reporting the use of uniform 
weighting of all response categories in an item provide fairly over- 
whelming evidence that it is not effective in increasing the relia- 
bility of a test. From formulas presented by Wilks (1938) and Gulliksen 
(1950) on the correlation of weighted sums it is generally agreed that 
when the number of predictor variables (items) is large and only posi- 
tive weights are used, the effects of any weighting system are limited. 
Even when random sets of positive weights are used the resulting corre- 
lations between weighted and unweighted scores are high. However, as 
Stanley and Wang (1968) point out, uniform weighting of item response 
categories may still be useful for increasing predictive validity. 

Douglass and Spencer (1923) investigated the utility of weighting 
the exercises or items in objective tests. They obtained correlations 
of .98-. 99 between weighted and unweighted scores on four parts of an 
algebra test given to 25 secondary-school students. They found analogouj 
correlations for the Henmon Latin Test (r = .98) and for the Gregory Tesi 
of Languages (r = .99). All three examples involved the scoring of the 
same test items in two different ways. The fact that spuriously high 
correlations might be obtained as the result of correlating errors of 
measurement was apparently not considered. Although no conclusions were 
drawn or recommendations made, they did note that the results were in 
accord with earlier work by Charters (1920). Douglass and Spencer 
stated that the weighting procedure was time-consuming, tedious, and 



increased the possibility of error in test scoring. 

Holzinger (1923) found similarly high correlations between 
weighted and unweighted scores. On a 40-item test of French grammar, a 
correlation of .99 between weighted and unweighted scores was obtained. 
Similar results were obtained for an algebra test and an arithmetic 
test. 

West (1924) reported the results of a fairly thorough investi- 
gation of the effects of weighting test items on three different tests. 
In each case the weighting .method was the same. Weights for items 
were a function of the proportion of examinees who incorrectly answered 
each item. The first study by West -compared weighted and unweighted 
scores on each of five parts plus th ^ total score on two forms of a 
reading-comprehension test. Only on? .>f the twelve correlations obtained 
was below .99. The Army Alpha Test (Form 8) was administered to the 
same group of 45 secondary-school students and the effects of item 
weighting on six of the eight parts were studied. Correlation coeffi- 
cients between raw and weighted scores ranged from .940 to .984. West 
noted that the intercorrelations of the part scores were similar for 
both types of scoring. 

A third test, a collection of 200 analogies, was divided into 
five measures of 40 analogies each. The tests were designed so that the 
accumulated scale values for each test would be the same. Each test was 
administered to the same group of 45 secondary-school students used 
earlier. Scoring of each test was done in three ways. An unweighted 
(raw) score, a Pintner Scale score, and a weighted score were obtained 
for each test. Intercorrelations of the five tests were computed for 
each scoring procedure. Correlations between each of the 10 pairs of 
tests scored in each cf three ways were computed. The 30 correlation 
coefficients varied frpm one scoring method to the other. In fact, 
West noted that the rank ordering of subjects based on each of the 
methods were markedly similar. 

West concluded that weighting of test items was generally not 
valuable for purposes of more accurately differentiating the measured 
abilities of examinees. He did, however, note that some value might be 
had in weighting items for purposes of scaling and arranging items in a 
test and then scoring the items in the conventional "raw -score" manner. 

Peatman (1930) attempted to determine the value of Clark's Index 
of Validity as a weight for true-false test items used in determining a 
subject's relative standing or >rade. Data were obtained for 73 college 
students on six 25-item true-fa. se quizzes and a final 100-item true- 
false examination. For the six quizzes the correlations between weighted 
and unweighted scores ranged from .879 to .970. The same correlation for 
the longer final examination was .955. A "combined score," an average of 
all quizzes and the final examination, both weighted and unweighted, 
yielded a correlation of .978. Peatman concluded on the basis of these 
findings that weighting cf true-false items by the method used was not 
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justified. The high correspondence between the original and weighted 
scores resulted in few changes in the relative standing of subjects 
whose grades were determined by these methods. 

Corey (1930) had six psychology instructors evaluate 73 items 
from a psychology examination with respect to each statement *s impor- 
tance for a general knowledge of psychology. Correlations between each 
instructor's weighted scores and the raw scores, using 100 randomly 
selected test papers » were obtained. These correlations ranged from 
♦82 to .96. They were interpreted to indicate that weights assigned by 
all instructors, save one, noticeably affected the relative standings 
of the students. It was also found that, in the case of one instructor* 
49 per cent of the test papers would have been assigned grades at vari- 
ance with those assigned using the raw-score method. Corey observed 
that the grades given by competent judges who weight each test item 
differently will vary considerably from those grades assigned on the 
basis of raw scores. He concluded that the objectivity of raw-score 
weighting is spurious because some items are naturally more important 
than others. No information as to the reliability of the judge's 
ratings was presented. 

Because the conclusions reached by Corey (1930) disagreed with 
earlier evidence indicating that item weighting makes no difference, 
Odell (1931) conducted two studies similar in several respects to the 
earlier investig<ation by Corey. In the first study, Odell obtained 
six sets of weights for a 50-item four-choice test. Weights were deter- 
mined by random assignment of weights to items in three of the six 
methods. Even when the "random weights" were used, the correlations 
between weighted and unweighted scores for 62 test papers ranged from 
.92 to .99. When weighted scores and scores corrected for guessing 
were correlated, the range of coefficients remained in the range of 
•98 to .99. In the second study, a 22-item true-false test was used. 
Weighting for three of the methods was determined by instructors. 
Correlations between weighted and unweighted scores ranged from .95 
to ,98. Odell concluded that little is to be gained from weighting 
items in objective- type examinations, a conclusion at variance with 
that of Corey. 

Neither Corey nor Odelx presented evidence of the reliability 
or validity of either weighted or unweighted scDres. Further, no data 
were presented on the correlations among the se.ts of weights obtained 
from the judges, Odell did reveal, however, that some of the judges 
in Corey's study attached weights of zero to some items. 

A study by Potthoff and Baraett (1923) was concerned with the 
effects of the weighting of test items on the grades of individuals. 
Eleven methods were used to score a lOO-item examination in high-school 
American history. Ten of the scoring methods were based on ratings b> 
ten history instructors. One weigl.ting method was the equal or un- 
weighted system ordinarily used. Potthoff and Bamett were primarily 
concerned with the agreement between the weighted and unweighted scoring 
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methods with regard to the assigmnent of grades based on test scores. 
The average agreement between all raters for all grading categories and 
the grade assigned by the unweighted method was 88 per cent. The authors 
cautioned the reader that» even when correlations between weighted and 
unweighted scores are high (^.96), letter grades may still disagree 
considerably in some cases » especially in the middle (B-C) range. 
Potthoff and Barnett concluded that, for practical purposes, the differ- 
ences between weighted and unweighted scores are generally so small that 
they can be disregarded and a great deal of labor can be saved by using 
the conventional, unweighted method of test scorings 

Stalnaker (1938) considered the question of weighting as it 
affects the essay-type examination question. Citing several examples 
of weighting various College Entrance Examination Board essay questions » 
Stalnaker reported correlations between weighted and unweighted scores 
as being above .97 for tests in a variety of subject areas. Even when 
weights were assigned to Items baaed upon the position of the item in 
the test, the obtained correlation between weighted and unweighted scores 
was .99. This indicated to Stalnaker that, because of the small net 
effect and the laboriousness of the weighting procedures employed, 
weighting of items is not extremely valuable. 

Although Stalnaker 's paper provided no mathematical treatment of 
the effects of weighting test items, Wilka (1938) demonstrated the 
effects analytically. Wilks showed that, in a long test (50-100 items), 
when the item respcnoses are positively intercorrelated, vreighting items 
has little effect on the rank order of scores. In fact, when the 
ntmber of items is large, the rank order of scores tends to become 
stable, or invariant, for different methods of obtaining linear scores. 

The foregoing review of the empirical studies of the effects of 
weighting test items leads to the general conclusion that it is not 
worth the trouble to apply uhe same weight to all choices in a multiple- 
choice item or to credit assigned for an essay question. And Wilks' 
analytical paper provides the mathematical rationale and proof of why 
this conclusion is warranted. This concltision must not, however, be 
applied to the use of differential response-category weights. There 
is evidence that differential weighting of incorrect responses can be 
of considerable value for increasing test reliability. 



Differential Weighting of Item Response-Categories 

Empirical investigations of weig|iting response categories of 
test items differentially stems from work using interest and personality 
inventories. Some of the earlier work using this approach to item 
scoring was do.ne by Strong (1943) and Kuder (1957). Both of these 
investigators havi^ reported positive empirical evidence of the value 
of differentially weigjiting response categories ot items in interest 
inventories. Their work, however, involved the weighting of response 



categories of questionnaire-type items with no correct answer* Weightinj 
response categories of items with no correct answer is generally consid- 
ered to be scaling and does not directly relate to this study. On the 
other hand» several different scaling techniques have been shown to be 
applicable to weighting response categories in aptitude- type tests. Of 
particular interest is a method proposed by Guttman (1941) for use in 
scoring interest inventories. Analytical and empirical evidence of the 
utility of differentially weighting response categories in aptitude and 
achievement tests is of particular importanpe to the present study. 

One of the earliest studies using a weighted-choice test-scoring 
procedure with an ability-type test was conducted by Staffelbach (1930)* 
Using a sample of 244 eighth-grade students for whom both test data 
and criterion data (semester grade averages) were available, Staffelbach 
obtained raw-score regression coefficients for three scores on a 60- 
item true-false test; number right, number wrong, and number omitted. 
The regression coefficients were .5017, -.5489, and .3559 for the 
rights » wrongs, and omits, respectively* Wrong responses were weighted 
slightly more heavily in the negative direction than were the right 
responses in the positive direction. Omits were assigned a positive 
weight. Thus, marking the correct response and recognizing inability 
to answer were both given positive weights in this system. 

Since the Staffelbach study involved a true-false test the 
differential weighting was not of incorrect responses but of Incorrect 
as opposed to omitted responses. In this sense the weighting is 
similar to the now-common co.rection-for-guessing formula. In fact, 
the weights for right and wrong responses are quite similar in that 
they are approximately equal, but differ in sign. 

Kelley (1934) described a response-category weighting procedure 
that takes into account the item-criterion correlation when both vari- 
ables are dichotomous. The formula presented by Kelley is 

W - b2l/ab2i 

where W « tha response weight; 

b2i * the regression coefficient of the criterion on the 
item, and; 

2 

%2i « the variance of the regression coefficient. 

This procedure for weighting item choices or, actually, any responses 
that are dichotomous, was recommended by Kelley for use with interest- 
inventory items like those developed by Strong (1943) 

Guilford, Lovell, and Williams (1942) investigated the effects 
of differential response-category weighting on test reliability and 
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predictive validity. The items for which response-category weights 
were obtained consisted of the first 100 (the first 101 minus one item 
known to be defective) items of a 308-item final examination in general 
psychology. A total test score was obtained by scoring all 307 items 
with a correction for chance success. The directions to the examinees 
did not state this fact, however. From 300 answer sheets drawn at 
random data from the 100 sheets having the highest total scores and 
from the 100 sheets having the lowest total scores were used to obtain 
approximations to the per cent of t'le sample marking each category and 
to the phi coefficient between total test score (treated as a dichotomy) 
and the dichotomy of "mark" or "not mark" each response. These data 
provided the basis for the response-category weights as described by 
Guilford in an earlier study (1941). 

Reliability coefficients of scores based on weighted response 
categories and on the conventional scoring form^^la were obtained from 
a sample of 100 papers drawn from the 300 used to establish the cate- 
gory weights. Scores on odd and even items were obtained by both 
scoring procedures and the correlations of odd and even scores were 
corrected by the Spearmf.n-Brown formula. The reliability coefficients 
for scores derived from the 100 items were .922 for the weighted scores 
and .899 for the unweighted scores. For the scores derived from the 
first 50 items, the analogous reliability coefficients were .860 and 
.844. Similar reliability coefficients for the first 20 items were 
.677 and .649. The statistical significance of the difference between 
each pair of reliability coefficients could not be tested or estimated 
without additional data. Thus, no conclusions about the statistical 
significance of the differences were reached. 

Any comparison of the difference between the reliability 
coefficients in each pair must take into account the fact that the 
100 answer sheets used to compute the reliability coeffxcients for the 
weighted scores were drawn from the same sample on which the weights 
were established. That this procedure leads to spuriously higjh 
reliability coefficients must be considered a serious possibility. 
Even with this in mind the data suggest that the use of response- 
category weights of the type used by these investigators provided 
scores little more reliable than those obtained through conventional 
scoring procedures. 

It is quite possible that the items themselves were of a 
nature that did not encourage the use of partial information for 
marking choices among distracters. Also> the items may have been 
easy, thus naking the use of differential response-category weights 
less likely to contribute reliable information to the test scores. 

Several investigators (Coombs, Milholland, fit Womer, 1956; 
Dressel & Schmid, 1953; Hawver, 1969) have presented scoring proce- 
dures that attempt to assess partial knowledge available to an 
examinee. 
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The Dressel and Schmid study (1953) was among the first to 
investigate modified multiple-choice items to determine whether they 
could be made to be more discriminating. Five groups of approximately 
90 college students each first received a ''standard test." This 
standard test was used to determine the equality of the groups. 
Three of the groups then took a single 44-item multiple-choice test 
but with differing instructions on how to respond to each item. The 
first group received instructions that the score was to be number 
right. The second group received instructions to mark as many choices 
per item as necessary to insure marking the correct choice. This "free- 
choice test" was believed to take the student *s certainty of response 
into account. A third group was asked to indicate certainty of 
response to each item by assigning a number from a 4-point "certainty 
scale." This was termed the "degree-of -certainty test." A fourth 
group took a modified version of the 44-item test with the choices in 
each item changed so that more than one choice could be correct. The 
students were informed of this fact. This "multiple- answer test" was 
designed to compel the students to assess each item more thoroughly. 
Finally, the fifth group took a modified version of the 44-item test 
with the choices changed so that there were two correct choices per 
item. Examinees were informed that the scores would equal the number 
of items marked correctly. 

Comparing the reliabilities and validities of the tests on 
which the five special scoring methods were used with those of the 
standard test, Dressel and Schmid reported no significant differences. 

Coombs, Miiholland, and Womer (1956) presented reliability 
coefficients of three 40-item tests that had been administered and 
scored conventionally and in such a way as to incorporate the effect of 
using partial information in marking test items. The reliability 
coefficients for the conventional and special procedures, respectively, 
were .72 and .73 for a vocabulary test, .64 and .70 for a driver- 
information test, and .89 and .91 for an object-aperture test. The 
ctatistical significance of the difference between reliability coeffi- 
cients in each of the pairs of coefficients could not be obtained 
since the coefficients were obtained by Kuder-Richardson formula no. 
20. 

The authors provided data showing that the examinees used 
partial information in answering items in the vocabulary and driver- 
information tests. Their analysis of responses of examinees to diffi- 
cult and easy items provides a statistically significant confirmation 
of the expectation that the reliability of a test composed of diffi- 
cult items is more likely to be increased by the use of response- 
category weights in scoring than the reliability of a test made up of 
easy items. 

Nedelsky (i954a) described a method by which the choices in 
multiple-choice items could be classified into three general categories. 
Instructors classified responses as R responses or right answers. 
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F responses or wrong responses th^t would have appeal only to the poorest 
students » and W responses » wrong responses other than F responses. 
Another paper by Nedelsky (1954b) was concerned with the uses of the 
F score made by an examinee and the nuod)er of F responses chosen In a 
multiple-choice test. The properties of the F score were studied alone 
as well as In combination with the R score. The composite (C) score 
resulting from tnls combination superficially resembles the conmon 
"formula-scoring" procedure that provides a penalty for guessing. The 
score C Is defined as: 

C » R - F/f 

where C Is the composite score; 

R Is the "rights" score; 

F Is the F score (number of F responses chosen) » and;- 

f Is the average number of F responses per Item In the test. 

In this study » Nedelsky^ analysed a 113-ltem multlple«*cholce test 
given to 306 students completing a course In the physical sciences. 
Grades for the students were determined op a basis of. the R scores. 
The "experimental group" contained all students receiving a grade of D 
or F and a representative sas^le of those who received higher grades. 
Kuder-*Rlchardson reliability estimates were calculated for R» F^ and C 
scores for the A» B» and C stiidents» for the D and F students » and for 
the total group. These coefficients Indicated that the R score had a 
negative reliability for the D and F students. The F-score reliability 
of .42 was the highest obtained for this group » the C-score reliability 
being .26. Interestingly » the C-score reliability calculated for the 
A» B» and C group and the total group exceeded the R-score reliability 
by at least .02 in the first case and .03 in the second. It was noced, 
however » that only 70 of the 113 items in the test had any F responses 
in them. 

Over-all » the C score was considered to be the most reliable 
score calculated from the data on this sample of examinees. Nedelsky 
posits that the F score "...furnishes evidence of the existence of. an 
identifiable aDllity to avoid gross error in a given field and (for 
considerable differences In this ability among the poorest students cf 
a class (p. 464)." 

Merwln (1959) provided a detailed theoretical analysis; of six 
methods of scoring three-choice Items. Methods using tvo, ree» ^ 
six response patterns were considered in conjunction with .i^ecui iva 
integer weights and weights which maximise the correlatior. f Ite/^ 
scores with the criterion. If each subject is instructed ^^ r^fti^. the 
three choices in an item according to their attractivene/?^ ^ thcsre are 
only six different response patterns available. Thus» t^^i'/: dlf/:etent 
scores can be assigned to a single item. For exsmple^ tha permuted 
response patterns of "abc," "acb," and "cab," etc. are Assigned differ- 
ent wei^ts. In the three-score paradigm, only the ra/.^ o£ the correct 
alternative is considered. And in the two-score sch'ij»e, thig subjects 
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merely indicate their first choice ♦ This third method is the commoo 
"rights-only" method of scoring* 

Weighting of response patterns was accomplished through either 
integer weights or weights proportional to the mean criterion score 
for subjects choosing the particular response pattern* The weiguLs in 
the latter case were identical in kind to those described by Guttman 
(1941)* Each scoring and weighting combination was studied by system- 
atically varying the item parameters and studying the effects on the 
"efficiency indexes*" What Merwin termed the "efficiency index" is 
actually the product-moment correlation coefficient between item scores 
and a specified criterion* Merwin summarized his theoretical study by 
saying that the use of the six-score scheme, in combination with the 
Guttman-type weights, will always yield item validity efficiency as high 
or higher than any other method studied* Merwin also pointed out, 
however, that the increases are relatively small and would be smaller 
after cross-validating the obtained response weights* For efficiency 
and ease of scoring, the "best" method studied was that using three 
integer weights, +1, 0, and -1 with the three-score scheme* 

The two papers that follow are considered in much greater detail 
than others included in this review because of their direct relation to 
the present investigation* The article by Davis and Flfer (1959) pre- 
sents empirical evidence of the value of response-category weighting of 
the kind used in the present study. The second article by Davis (1959) 
describes analytically choice-weighting procedures that he recommends* 

Davis and Fifer (1959) investigated the effects of response- 
category weighting ot multiple-choice items on the reliability and 
validity of an achievement test. From approximately 300 arithmetic- 
reasoning items constructed especially for this study, two matched sets 
of 45 items were chosen* In addition, two matched sets of 5 itma 
testing computational skills were also constructed and included in the 
arithmetic-reasoning tests* These "computational" items, when scored 
appropriately, served to cancel some variance in the test scores that 
might be attributed to computational facility and not arithmetic 
reasoning* 

Two mathematicians » working independently, assigned weights to 
each choice in the two 45-item tests* These weights were on a seven- 
point scale from -3 to +3* These two sets of weights were then recon- 
ciled to obtain one set of a-priori weights for all choices in the two 
tests (5022 and 5023 were the test laLels)* This same procedure was 
carried out for the two sets of five "computational" items* The signs 
of the weights for the "computational" items were reversed, however, 
to make them serve as a "suppressor" variable for computational facility* 

Both tests (5022 and 5023) were administered to a sample of over 
1000 airmen at Lackland Air Force Base* Prom this initial group, answer 
sheets of a subsample of 370 airmen were drawn at random and scored 
using the a-priori weights* Empirical weights, expressed as biserial 
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correlations between total teat score and marking or not narking a 
choice^ were calculated for each choice. The empirical weights were 
then modified so that no wrong answer was allowed to have a scoring 
weight higher than that of the correct answer to the Item of which it 
was a part. 

The remainder of the sample from which the 370 cases had been 
drawn at random was used to test the effect of these differential 
response**categor7 weights on the reliability and validity of the two 
tests. Four scores were obtained for each examinee in this sample. 
They were: 1) nusber correct on test 5022; 2) number correct on test 
5023; 3) the sum of the choice weights for choices marked on test 
5022, and; 4) the sum oi th^\ choice weights for the choices marked on 
test 5023. 

After the raw scores had been convartad to normalized standard 
scores* a parallel-*forms reliability coefficient for the unwell^ ted 
scores on tests 5022 and 5023 was calculated by correlating the "number- 
rights" scores on these tests .'^ The obtained coefficient was .6836. 
By correlating the empirically modified weighted test scores for forms 
5022 and 5023, a parallels-forms reliability coefficient of .7632 was 
obtained. After these r*s had been converted to Fisher's z values , 
the difference in s's was found to be statistically significant 
(p<.001). Davis and Flfer noted that this Increase in reliability 
would have been obtained if the tests had been scored "number right" 
only after their lengths had been increased by 50 per cent. 

TWO criterion measures were used In assessing increases in 
validity due to choice weigjhtlng of these two tests « One iriterion 
consisted of teachers' ratings of pupil's abilities to solve arithme- 
tic**reasonlng problems. The second consisted of scores on a free- 
response version of itfcms in either 5022 or 5023 « A sample of 251 high- 
school students was divided Into four groups* Each group received a 
free-response version of either 5022 or 5023 and a multiple-choice 
version of 5022 or 5023 « Administration of the different forms was 
coimter-balanced In the four groups to guard against testing-sequence 
effects. The two groups receiving the mtiltiple-choice version of 5022 
were coinbined. as were the two groups receiving the multiple-choice 
version of 5023. Validity coefficients were obtained between the 
multiple-choice tests (scored by the two methods) • teacher ratings » 
and the free-response versions of 5022 and 5023. The two coefficients 
between the teachers' ratings and the multiple-choice tests scored 
"rights only" and by empirical weights were .39 and .42^ respectively. 
The coefficients between the multiple-choice te^ts scored both ways 
and the free-response test were .69 and *68t respectively. Neither of 
these differences approached statistical significance. Davis and 
Flfer concluded that significant increases in test relisbllity can be 
gained without reducing test validity by using weights for each choice 
of a well-constructed test^ 



Davis (1959) Is more explicit about a method of estimating 



choice weights that he recommend* for practical iiae. The procedure for 
obtaining choice-weights that tends to saxlmlze the correlation of any 
set of Items with any given criteria is quite similar io that described 
by Guttman (1941) Guttaan's procedure entails '^e calculatlor of the 
mean criterion score for the group of examinee ihat select each choice 
in every multiple-choice item. The actual choice weights are propor- 
tional to these mean criterion scores. As Davis pointed out, this 
procedure would be extremftly laborious without the use of high-speed 
computers. An alternative, short-cut procedure suggested by timagan 
(1939) was used by Davis. This method provides approximations to the 
Guttman weights by simply reading rhem from * table published by Davis 
(1966). 

The Flanagan-Davis procedure entails the estiaatlon of tk*c t 
the weight for choice k of item i. The symbol dtnotcs the mean 

criterl ' standard score for the group of examinees who marked choice k of 
item 1. To estimate thlt» weight the correlation r^iy-tf. ^•^weea 
item-choice standard scores, «ikt *nd the criterion standard scores, 
«c, can be read from a table devised by Flanagan (Flanagan and Uavls, 
1950) if the per cents of examinees in the upper and in the lower 27X 
of the criterion distribution who selected the choice are known. Since 
Z4t. for ite« 1 can be read as the norma?, deviate corresponding to pij^, 
the per cent who mark the choice, then ..T cat be estimated trom the 
regression equation. Davis (1966) vvwU'i .» table for this purpose. 

Davis determined the accuracy of this estiaiation procedure by 
actually calculating the mean criterion standaid scores for ex««lnees 
responding to eacb item choice in a 45-item arithmetic-reasoning test. 
The estimation procedure was also carried out for each item choice la 
the same test. The obtained correlation betweer computed means and 
estimated aieans for the 45 correct choices was .9^. For the 180 dls* 
tracters the correlation was .91. These correlations and the close 
similarity of the means and standard deviations of the sets of weights 
showed that the estimation procedure is highly satisfactory. 

To assess the reliability of weights estimated by this procedu,.e. 
Davis obtained two samples of 370 examinees who took tests 5022 and 
5023, the parallel-forms arlthraetic-reaBoning test used in Davis and 
Fifer's Investigation (1959). Choice weights for both tesc3 were esti- 
mated for the two samples. A correlation between the weights for the 
two tests estimated in two Independent samples constituted a relia- 
bility coefficient for the weights. Davis found the reliability 
coefficient of the correct response weights to be .64 and, of the dls- 
tracters to be .67. These coefficients are significantly different 
from zero, are moderately high, and could be Increased by using largor 
samples for establishing the weights. 

More recently. Sabers and White (1969) reported an empirical 
study of the scoring procedure previously described by Davis and Fifer 
(1959) and Davis (1959). rhese investigators used four groups of 
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examinees, two groups enrolled in a modem mathematics program and two 
groups enrolled in a traditional algebra course. All choices on the 
Iowa Algebra Test were weighted using a chart devised for that purpose 
by Davis (1966). The criterion measures were 40-item multiple-choice 
tests scored number correct. Sabers and White cross-validated the . 
weights by applying the weights derived on one group to the other 
group in the same mathematics category. Non-significant increases in 
reliability and validity were reported. 

The main focus of an investigation by Hendrickson (1971) was 
to determine the effects of choice (response-category) weighting on 
the internal-consistency reliability of four subtests of the Scholas- 
tic Aptitude Test (SAT) . The effects of the weighting scheme on the 
intercorrelations of the subtests and the regression of scores derived 
from Guttman weighting on those obtained through the conventional 
formula-scoring method were also investigated. 

The first study by Hendrickson compared the internal consistency 
reliability coefficients of four subtests of the SAT when they were 
scored with the conventional correction for chance success and with 
cross-validated Guttman weights. Comparisons fur male and female 
examinees were treated separately to ascertain any sex-related differ- 
ences in the effects of choice weighting. 

The effective increase in test length varied from subf'st to 
subtest and between sex groups but was no less than 19%. That is to 
say, a subtest cpuld be reduced in length by almost 20% and, if 
scored using Guttman weights, would have the same internal consistency 
as the longer test. Overall, the average effective increase in test 
efficiency was 49%. Thus, the use of Guttman weights could save 
considerable testing time without loss of reliability. As Hendrickson 
points out, the Guttman weighting scheme depends upon the correctness 
of the assumptions that (a) the quality of response categories differs, 
and (b) that groups of similar levels of knowledge about the point 
being tested tend to choose the same category. 

Another part of the investigation revealed a significant linear 
relationship between Guttman and formula-score distributions. Inspec- 
tion of the plot of the regression of Guttman scores on formula scores 
showed greater dispersion of Guttman scores at lower values of formula 
scores. This was taken to indicate that Guttman weighting affects lew- 
scoring examinees more than high-scoring ones. Nedelsky (1954b) demon- 
strated a similar effect using another weighting scheme. 

A comparison of the response-category weights for men and women 
indicated that, when the weights derived for each sex were interchanged, 
the distribution of total scores was essentially unaltered. Hendrickson 
did, however, indicate that while the sexes did not respond differently 
to the items as a whole, they did respond differently to the choices. 
It was suggested that this may be a neglected source of bias in testing 
procedures that is deserving of attention. 
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In sum, Hendrickson found that Guttman weighting resulted In 
Improved internal-consistency reliability for certain subtests of the 
SAT. The effects were more pronounced for the verbal subtests, but the 
weighting procedure also was beneficial in the quantitative subtests. 
As expected, a linear relationship was found between scores derived 
from Guttman weights and those derived through conventional formula 
scoring methods • 

Reilly and Jackson (1972) conducted an investigation quite 
similar in many ways to the present one. They attempted to provide 
additional evidence of the value of empirical choice weighting in 
improving the internal-consistency reliability, parallel-forms relia- 
bility and validity of a high-level aptitude test, the Graduate Record 
Aptitude Examination (GRE) . 

Three types of scoring procedures were employed. One was the 
conventional formula scoring. A second involved weighting itemrresponse 
categories by assigning the mean standard score on the remaining items 
for all persons marking that choice. This second procedure is essen- 
tially the one employed by Hendrickson (1971). The third weighting 
procedure involved assigning to each option in an item a weight which 
was the mean standard score on the corresponding parallel-form of all 
persons choosing that option. 

Cross-validated weighting procedures on the sub-forms of the 
GRE revealed substantial increases in both internal-consistency 
reliability and parallel-forms reliability* The increases in both 
types of reliability follow a similar pattern with increases in 
effective changes in test length ranging from one and one-half times 
to more than twice the original length for the verbal sub-forms of the 
GRE. 

The effects on improving test validity were less impressive. 
Using a sample different from those used to obtain the empirical 
weights, weighted and unweighted GRE scores were used to predict 
grade-point average (GPA) for over 4,000 college students. The 
weighted scores produced a multiple R .05 less, on the average, than 
the conventional formula score. Thus, empirical choice weighting to 
improve reliability did not lead to improved predictive validity for 
the GRE verbal or quantitative scores. 

Item response-category weighting, when the weights are based 
upon procedures similar to that described by Guttman (1941), may ].ead 
to improved internal-consistency and parallel-forms reliability when 
the appropriate criterion is employed. This has been shown by Davis 
and Fifer (1959), Hendrickson (1971), and Reilly and Jackson (1972). 

Success in improving predictive validity when a test is weighted 
to increase reliability has been illusory. Davis and Fifer (1959) 
obtained no significant change in the validity of a mathematics- 
reasoning test by using differential choice weights. Reilly and 
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Jackson (1972) obtained lower validity coefficients with choice-weighted 
scoring than with the conventional scoring with correction for chance 
success 

The empirically derived weight for the "omit" category for an 
item has been discussed recently by several authors (Green, 1972; 
Hendrickson, 1971; Reilly & Jackson, 1972). Although Green admits that 
reliability can be improved by usin$» Guttman weights. Green presents 
arguments against the use of such weights for one reason. The Guttman 
weight for omission of an item usually penalizes the examinee severely. 
In his investigation. Green found that, in general, people who omit 
items obtain lower scores on a test than those who guess when in doubt 
about the correct alternative. Because test directions often caution 
examinees about guessing, Green is of the opinion that it is unethical 
to use Guttman weights for scoring. 

Hendrickson (1971) suggested that weighting the distracters and 
omit categories had more effect on scores than weighting the correct 
category. Like Green (1972), Hendrickson found that examinees who 
tended to omit items also tended to score lower on the test as a whole 
than examinees who mark incorrect categories. Gains in internal con- 
sistency or parallel-forms reliability seem to be due to the effects 
of weighting on low-scoring examinees. Since low-scoring examinees 
tend to mark more distractecs and omit more items than high-scoring 
examinees the effects of Guttman weighting are more strongly felt by 
those at the low end of the score distribution. 

The weights for the omit categories for the GRE test items used 
by Reilly and Jackson (1972) were not what the investigators expected. 
Examinees were given a bonus for not responding to some of the verbal 
items. For the quantitative items examinees always received a penalty 
for omitting an item. The investigators, like Slakter (1967), 
suggested that, while the propensity to omit items is reliable, it is 
not valid for predicting some external criterion. This was offered 
as an explanation for the decreases in validity in spite of the 
increases in relaibility. 

It may be concluded from tht recent work of Heudrickson (1971) 
and Reilly and Jackson (1972) that increases in reliability can be 
attributed primarily to the differential weighting of distracter and 
omit categories. In particular, weighting of the omit category seems 
to provide these increases because omitting items is a characteristic 
of certain examinees and the effects of this characteristic are reli- 
able. However, as Green (1972) points out, instructions for multipxe- 
choice examinations where a correction-f or-guessing formula is used 
regularly, caution the examinee about wild guessing. The implication 
for the examinee is to omit when in doubt. Those examinees who omit 
items tend to be penalized for following directions. It would seem, 
then, that either the directions for test taking should be changed or 
the category weight for omitting an item should penalize the examinee 
less. It seems that the examinee who is aware of what he does not 
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know should not be penalized more than the examinee who is not aware 
of what he does not know and selects incorrect answers. 



Summary 

It has been found experimentally that weighting the correct 
responses to some items in a test more than others usually has no 
appreciable effect on test reliability or validity. The mathematical 
explanation of this finding has been provided by Wilks (1938). 

On the other hand, the differential weighting of all choices 
in each item in a test can have a marked effect on test reliability. 
As Davis and Fifer (1959) indicated, the differences among the 
weights assigned to the incorrect choices in an item mainly account 
for this effect. 

The results of differential response-category weighting on 
test validity depend on the criterion variable used for establishing 
the weights. It is possible that a set of weights capable of increas- 
ing test reliability may decrease test validity for specified criteria. 
The extent to which this happens in practice is not yet clear. 
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CHAPTER III 



THE RELIABILITY STUDY 



Purpose 

The purpose of the reliability study in this investigation of 
the effects of differential choice weighting on test reliability and 
validity was to compare the parallel-forms reliability coefficients 
of Forms C and D of the Davis Experimental Reading Tests (Davis, 
1968), when scores were obtained by four different methods. 



Tests Used 

The nature and development of the Davis Experimental Reading 
Tests, Forms C and D, were described in detail by Davis (1968). 
Twelve items testing each of eight basic reading skills comprised 
each form of the test. Each item was made up of a stem and four choices. 
For additional information about these tests, the reader is referred to 
the article cited. 



Samples 

Davis (1968) administered his Experimental Reading Tests in the 
fall of 1966 to approximately 1,100 twelfth-grade pupils in academic 
high schools in the suburbs of Philadelphia.. Since the tests were 
designed to measure several aspects of comprehension in reading, time 
was allowed for every pupil to try every item at each of two testing 
sessions and schools drawing largely from middle-class and upper- 
class homes were used. These procedures minimized the effects of the 
mechanics of reading on the test scores. 

From Davis *s basic list of examinees, three groups were drawn 
at random wxLhout replacement. Within the first group, two samples 
(denoted IR-C and IR-D in Table 1) of 330 examinees who took Form C 
and whose corresponding answer sheets for Form D were identified In 
the group and 331 examinees who took Form D and whose corresponding 
answer sheets for Form C were identified in the group. 

Within the second group, two samples (denoted 2R-C and 2R-C in 
Table I) were formed, consisting of 328 examinees who took Form C and 
whose corresponding answer sheets for Form D were identified in the 
group and 331 examinees who took Form D and whose corresponding answer 
sheets for Form C were identified in the group. 
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The third group was made up of 360 examinees for whom answer 
sheets for both Forms C and D were available* This sample is denoted 
3R in Table 1* Table 2 provides descriptive statistics pertaining to 
all of the samples used in the reliability study* 



Scores To Be Compared 

The four methods for obtaining scores to be used in obtaining 
parallel-forms reliability coefficients for Tests C and D in Sample 
3R are as follows: 

Wl: For each item, examinees were credited with 1 point for 
a correct response, 0 for an incorrect response, and 0 for omission 
(failure to mark any choice as correct after reading the item). The 
total test score consisted of the sum of the item scores in it. This 
is commonly called "number-right scoring." 

W2: For each item, examinees were credited with 1 point for a 
correct response, 0 for omission, and -l/(k-l) for an incorrect response 
(where k represents the number of choices per item). This is commonly 
called "formula scoring," and embodies a correction for chance success. 

W3: For each item, examinees were credited with scores based 
on weights assigned to each choice and to the response category of 
omission (failure to mark any choice as correct after reading the item). 
Each scoring weight was made proportional to the mean criterion score 
for examinees who fell in a given response category. The criterion 
scores for establishing scoring weights for Form C were total scores 
obtained on Form D by method W2 in Sample IR-C. The criterion scores 
for establishing scoring weights for Form D were total scores obtained 
on Form C by method W2 in Sample IR-D. The total scores obtained by 
method W3 consisted of the algebraic sum of the scoring weights for 
the 96 response categories (one per item) selected by each examinee on 
Form C or Form D. 

W4: For each item, the examinees were credited with scores based 
on modified scoring weights assigned to each choice and to the response 
category of omission. Each of the scoring weights obtained by method 
W3 was "modified" by multiplying it by the partial regression coefficient 
that would maximize the multiple correlation between a set of linear 
composites of the 96 item scores in Form C (or in Form D) and a set of 
specified criterion scores. For Form C, the criterion scores consisted 
of total scores on Form D obtained by method W2 in Sample 2R-C. For 
Form D, the criterion scores consisted of total scores on Form C 
obtained by method W2 -^n Sample 2R-D. 
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Table 1 



Numbers of Examinees in 
Validation and CroGS-Validation Samples 1R, 2R, and 3R 
for the Reliability Study 



Test Form 

Sample ' — ~ 



1R 530 551 

2R 328 551 

3R 360 360 



Table 2 



Descriptive Statistics on the Criterion Variables* 
Tor All .Samples 
Peliabilivy Study 



Form C 

4.- San-.ple 
Statistics 

13? 2R 3H 



N 


33() 


323 


3t-';> 


Mean 


55.^93 


55.229 


1:^.202 


Variance 


^53.-253 


^37.391 


417.217 


St. Dev. 


21.29C- 


20.91^ 


20.^26 


Range 


87.670 


92.000 


93.330 


Skewness 


- 0.666 


- 0.63^ 


- 0.628 


Kurtosis 


- 0.551 


- 0.'i?6 


- 0.330 




Form D 






Statistics 








1R 


2R 


3K 



N 


331 


331 


360 


Mea;i 




54.282 


54.605 


Vnr ■••»)'"" 


4i;.354 


366.103 


356.061 


St. .)ev. 


20.961 


19.134 


18.332 


Range 


8^1.000 


90.340 


89.333 


Skewness 


- 0.kk3 


- 0.499 


- 0.4 10 


Kurtosis 


- 0.7^7 


- 0.594 


- 0.331 



♦Note. — The criterion variable differed for the 
groups. Dependinc upon the group the criterion was 
either Form C score or Form D score. 
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Determination of Scoring Weights for Method W3 



Guttman (1941) showed that the correlation ratio between item 
scores and a set of criterion scores could be maximized by scoring an 
item with a?. weight for each choice proportional to the mean criterion 
score of examinees who marked that choice. In this study, his procedure 
has been broadened from use with questionnaires to use with multiple- 
choice items of any kind and from its use to obtain scoring weights for 
item choices to use for obtaining scoring weights for other response 
categories 9 such as omission (failure to mark any choice as correct 
after reading the item) or failure to mark any choice because lack of 
time did not permit reading the item. 

To obtain W3 scoring weights for each of the possible five 
response categories for each item in Test C, the answer sheets for 
330 examinees vuo made up Sample IR-C were used. Their raw scores 
(after correction for chance success) on Test D were obtained. These 
corrected raw scores were then converted to normalized standard scores 
with a mean of 50.000 and a standard deviation of 21.066. These served 
as criterion total scores. 

Next, the mean criterion total score on Form D of those examinees 
who fell in each response category for each of the 96 items in Test C 
was calculated. These means were then transformed linearly so that, 
within each item, the sum of the products of each transformed mean and 
the number of examinees entering its calculation was made equal to zero. 
This constraint was suggested by Guttman (1941). The transformed mean 
criterion score for each item response category was used as the weight 
in method U3. 

Analogous scoring weights were then obtained for each of the 96 
items in Form D by using Sample IR-D. The W3 response-category weights 
for Test C are shown in Table 3 and the numbers of examinees on which 
they are based are shown in Table 4. Analogous data for Test D are 
shown in Tables 5 and 6. 

It should be noted that these scoring weights based on Samples 
IR-C and IR-D were free from spurious inflation because the criterion 
scores for the weights established for Test C came from Test D and the 
criterion scores for the weights established for Test D came from Test 



Determination of Scoring Weights for Method W4 

It is well known that the best linear combination of variables 
for predicting a specified criterion variable can be obtained by using 
partial regression coefficients to weight each predictor variable. The 
method used to obtain W4 weights in this study treats each of the 96 
items in Test C, scored by W3 weights established n Sample IR-C, as a 
variable for predicting total scores on Test D obtained by applying the 
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Table 



Reeponne-Catccory Weights for Each of the 96 Items in 
Form C of the Experimental Reading Test, 
Sample 1R 
(N a 330) 



Kpsoonse Category 



D 



I 
3 
4 
5 
6 
7 

<) 

10 

II 

12 

13 

14 

l«5 

16 

17 

l« 

l<> 

20 

21 

22 

23 

24 

25 

26 

?7 

28 

2<) 

30 

31 

32 

33 

34 

3«» 

36 

37 

3B 

39 

40 

41 

42 

43 

44 

45 

46 

47 

49 



-l.l4.?n*; 

-0.7444^ 
-0.79f>'?l 
-0.272M 
-0.69704 

0. 3«n^n 
-n.476?'; 

0.37707 
-0.7f^7 4? 
-0.2*5144 
-O.R*540R 

0.17713 
-0.70267 
-0.47430 
-1.5122' 
-0.«'>6'>n 
-0.«4f»14 

0.20636 

0.10310 

0. 3'63« 
-1.40371 
-0.54fl«^o 
-0.36556 

0.33^4^ 
-1 .'<47lft 
-0.27240 
-0.O147Q 
-0.fl70'5f> 

0.0704? 
-I .24R77 

o.C 
-0.2T2O7 
-l.560Tq 
-0.«224O 
-0.5R0«1 

-0.6?3'"i 
-1. 16470 

0.1 IQOO 

0.16513 
-0. 7275'^ 

0.20400 
-0.1 1 100 
-0.37600 
-0.6«^303 

0.44R10 

0.27510 
-0.46320 
-0.600';i 



0. 0^4RO 
■0.32P2? 

0'^<105 
-0.47775 
-I. 1104 7 
-0, ?0a7i 
-0. •^■^6"''^ 
-0.07124 

0.50107 
-0.««?720 
-O, 1757« 
-0.18525 
-l.6?«165 

0.23^71 

0.0 
-0. 70777 

^, 15454 
-0.^5374 
-0.2A-<23 

-0.OJ035 
-O, 5'<63'^ 

0. -^sa^fl 

-0. 76<»?? 
-O.702PI 
-0.51 17? 
06360 
0. IIIR3 
-1 . 04700 
-0.(S45-<? 
-0.77073 
-0. 07576 
-0.36'^P? 
-0. 0«"23 
-0.^51 R3 
-1.29051 
0.07216 
-1 .01578 
-1.20916 
0.2?46"< 
-0.64594 
0.07844 
-O. ?0O75 
-0.75716 
0,09013 
-0.50624 
-0.56155 
0. ^0546 



-0.0'^7«7 
0. 10^4* 
O.?o"'34 
0.477TQ 
-0. 3R44 0 
-0. 7439 A 
-0.523Q1 
-0.67*^09 
-0. 2608? 
-0.51215 
-0. 50^17 
-0. 120<»1 
0.29276 
-0.6?5P1 
0.35517 
-0.60905 
-1 .6fi^03 
-0.25630 
-P.<»?064 
-0.21767 
0. 2ll7r 
0.40360 
-0.6841* 
-0.15101 
-0.29600 
o. 0 3700 
0.2«?30 
-0.97?fll 
-0.03267 
-0.54P«6 
0. 15119 
0.29! 86 
-0.50296 
0.34«^36 
0.^3210 
-0. I6in7 
-0.49ni 
-0.11202 
-0.fl2754 
-o.nfii04 
-0.73639 
-1.22P«2 
0.220'^R 
0.22462 
-0.35300 

-1.341P3 
-O. 16792 
-0.303^0 



-1 . 
-0. 
-o. 

-0. 

o. 
-0. 
-0. 

o. 

o, 
-0, 
-0, 
-o. 

o, 

0, 
-0, 

-1 

-0 
-0 

-o 
-0 
-0 
0 
-n 

-0 

-I 

-0 
o 

-0 

-r\ 

0 
-0 
-0 

n 

_r> 

-0 
-0 
-o 

-0 
-0 
-0 
-n 

-0 
-0 
0 
-0 



565Q4 
6?7n 
4<»4 7! 

?2<i20 
55*57 
51634 
,51037 
,71423 
,5724B 
,?2073 
,15330 
,636 79 
,4435'^ 
. 02 1 2P 
.2155? 
.43?B0 
.04556 
.09442 
.16122 
.36491 
.35363 
.55321 
.2150? 
.05229 
.673^7 
.''6467 
.2369? 
. 50<^96 
.U70Q 
.60A75 
.64610 

.19133 
.62651 
.''3177 
.2110? 
.«^65«6 
.54R76 
.42444 
.25240 

.9?002 
,?q«n? 

.•^576? 
.3638'^ 
.7fl659 
.7'^43l 
.30101 
.?O025 



-0.407Rf^ 
0,7"<.-(3r 
-0.6'>"'6«^ 
-0. nf<75 
0. 06''>*^'^ 
".O <f>l 1 
0.05T«l 
-0. 1604r. 
-0.90074 
-0.03581 
-0.37551 
0.01760 
-2.0753' 

0.0 
0.0 
0.0 
-0.?47-»? 
0.0 
0.0 
-0.00704 
-0.14«7« 
0.0 
O.T 
-2.07'^31 
0,0 
3.0 
0.0 

0.0 

0.41554 
0.0 
-0.761"^'"' 
0.0 
0.« 
0.0 
0.0 

0.0 
0.0 

3.0 

-1*24077 

0.0 
0.0 
0.0 

0.0 
0.0 
! 0.79'»54 
-n,407PO 
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Table 3 
(Continued) 



Item 

50 
51 
52 
51 
5A 
55 
56 
57 
58 
50 
ftO 
ftl 
6? 
fcl 
6A 
A5 
6<S 
67 

70 
71 
1?. 
73 
74 
75 
7A 
77 
7B 
7«) 

no 
«l 

32 
PI 
P4 
H5 
«f> 
«7 

01 

92 
91 
OA 
95 
06 



A 

-2.A470A 

-0.99571 
-1.11255 
•0.5*^02^ 
■1.510'"'Q 
o.nooo« 

•0.34102 
•0.642A7 
■0.5/t6>V5 
-0.14-^05 
0.2077*'. 

•0.7?.5lf> 
•1.25095 
•0.5121 7 
0.29456 

0.3500? 
•0.3nftl-' 
•0.4B650 
•0.61257 

•o.i«2ef? 

0.07104 
•0.19391 
•0.87246 
•1.19193 
•0.15674 
•0.R-'457 
•0.2"71 R 

0.31270 
•0.11570 
■0.«46«^^. 
•1 .045?? 

0.42021 
•0.16m 
■0.7qoic, 

•0.902?6 
•I.OIO'O 

0.2155? 

0.l6??'-> 
•Ll^^Rll 

0.00051 
•0.<»fl4«? 
•O.An«5C 
-0.660'?! 
-0.5541 *» 
-0.2'<577 
-0.infl57 



Response TateRory 



-0i^l6'^9 
-r'.inl P6 
-').5749« 
-0.9n904 
-0.64P01 
-0.42772 
-1.02374 
-1 .0078? 
-0.81546 
-0.77769 
n. 20322 
-o.?72P« 
-0.07139 

0.00210 
(^24507 

» in612 

-O, 176»^ 
-I. 17*^00 
-O. 570*^4 
-0. 1 5QflO 

O.0P546 
-0.15P34 
-0.76021 

0.10226 
-0.60794 
-0.53148 

0. 532^6 
-0.6«;498 

0. 18301 

-o.nio?5 

0.1642 5 
-o,f,6554 
-0.5l4-<7 
-0.?0'^7-'> 

-0. 7944? 

0.09473 
-1.0f,775 

-0,4'''421 
-{>.n5'^f o 

-•^.54117 
-0.02515 
-0,r)c;4o7 

0.10110 
-1.445A7 
-0.13465 

0.30494 

o,i7a^o 



Omit 



0. 

0. 
0. 
•1. 

0. 
■0. 

0. 
-0. 
-0. 

0. 

•t. 
•1 . 

■0. 

-1. 

-0. 

0. 

0. 
■0. 

0. 

■1. 

-0. 
-0. 
•0. 
•0. 
•0. 
•0. 
•0. 
•0. 
0. 

•'^. 
■■). 
■<^, 
•1. 
-0. 
•1. 

-0. 

-'». 
-I. 

0. 
0. 
-0. 

-1. 

-0. 

-1. 



14546 
45461 
0 

07756 

05O?9 

42'^! 1 

7n40 

21176 

5710? 

l'?67 

91169 

n5P71 

244 77 

449*: 0 

00171 

1?451 

51749. 

707R7 

017P1 

1A456 

•^1104 

2fl?06 

19163 

23092 

21P94 

R2764 

10671 

41469 

49n 10 

m 29 

45 36P 
'56?q 
07H0] 

71710 
25051 
70rt6f^ 

9f:5P0 
370-^O 

74945 
54'^14 
74101 
175ns 
1 Rr44 
6V727 
4164P 
16'>7l 

24132 
17747 



-2. 
-0. 

0. 
-1. 
•o. 

0. 
•0. 

■1. 

0. 
-0. 
-o. 

■1. 

-0. 
■o. 
-n. 
-0. 
'0. 
-n. 

0. 
-0. 

0. 
■0. 

0. 

o. 

0. 

0. 
•0. 

0. 
■0. 
■0. 
-o. 
-0. 

0. 
-0. 

0. 

-o. 
-0. 

^, 
-1. 

-0. 
-o. 
-0. 

o. 

o. 
-o. 
, •> 



09141 
O47'«o 

06162 
75495 
«?P1 •> 
090S4 
07')R2 
1011 I 
1^78? 
79qp^. 

76'>07 
5'>70*^ 
7110'= 
9'?29o 
951 7*= 
63?4fl 
46I4P 
61 001 
23574 
61517 
275B5 
05090 
11968 
32209 
31411 
06P10 
17fl01 
21430 
40l'^2 
4R166 
26543 

!?7ni2 
iim 

18339 
144flo 

92 16^^ 
5647? 
1654« 
01365 
5117? 

161 6< 
6-»176 
746?? 

6«iai 

4?o5'* 
24675 
42'1R7 
■>. ? q ^ 9 



0.0 

1.61 ">os 
-l.liono 

0.0 

0.0 
-1. IIOOO 

0.0 

1.1777^ 
-1.54'>?5 
-0,P9716 

1.05S3r 

I.6I305 
0.0 

-0'.497R0 
1.6110S 

1.61195 
0.0 

-0.1 4076 
0.64156 
0. 0O«,75 

-0.29?OS 
0.0 

0.56966 
-1.1190R 
-1. 11908 
-i,75170 

-0.11066 

-0,91162 
-1.1 loop 

-0.44^63 
0.0 

o.o 

-1 .5'i.-i04 
-0,Al'-'7'^ 
-o.^lU5 
-O.lll 1 5 

-0, 74->45 

-K11 175 
-1,24 3 77 
-'^.75307 
-0. 9''>P*^? 

0.0 
-O, ?4045 
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Table ^ 



requency of Response to Each Response Category 
Fc-" C of V:ic : <p^v\ 'or.tal heading Tost, 
Sanr^plc 1R 



T 1 4>Tn 






(;; 






I 


I ^ 








i 


7 




A t 




r ^ 


c 






•> c; 


r ' ' 






§ 






f c: 




"I 
> 










7 A7 


•1 


n 


loo 




7 

c t 


7 


A 


7 


/ 

*♦ /; 


i2 A 


O 


1 OA 


A 

n 


S 


1 iL O 
1 O ^ 


A 


f V 


7A 


7 




1 AO 

in*' 


I >n 




/In 




1 ) 


1 3 .» 








A 


I I 


O A 

. n 


7 A 




7 1 O 


C 


12 




•> A 




AO 




I ^ 








r ') 


1 


14 


A Q 




A 
O 


7A 


J 




•> 


4 ; 




"31 7 :i 
7 ^ p 




In 


1 1 




/ H 


A 


A 

<# 


17 




7 A O 


C 


7 7 


'J 


IB 




C A 






1 


19 


if H ^ 


1 Q 


q 


1 A 
in 


A 

* / 


20 


O A A 

206 


A A 
OH 


^ 1 


Q 


A 

\ 1 


21 


1 1 


t A 

10 


7/3 


qa 


'1 


22 




A 1 
H I 








•23 


1 o o 






1 1 


A 

'^f 


7h 




A^ 


T 7 

f f 


•A 1 


A 

j» 


2Ti 


/ 


f 


1 


^OA 


1 




c 


c 

V 


111 


o 


A 


?7 




1 ' I ^ 


■» 

1 


C 


A 


28 


C 7 






A 


A 

1 1 


29 


O A O 


*> 




1 7 


A 

U 


A 

^0 


1 


1 c 








31 


A 

U 




'>7 7 




A 

/ 


32 


OQ 


A C 


too 


/ 




33 


i 

H 


Q f 

y% f 


I ? 


O O Q 






1 A 




/I H 




A 

; 




f)6 




70A 


19 






?3 


7 


107 


193 




37 


3 




0 


16 


0 


3fJ 


?79 


->? 


II 




0 


39 


2«?l 


A 


11 


f.4 


0 


^0 






17 


43 


I 




?37 




7ft 


If^ 


0 




2 


279 


ft 


Al 


0 , 


^3 


61 


50 


20 i 


1ft 


0 




?? 


10 


227 


A I 


0 


^5 


lio 


1? 


11^ 


?f> 


0 


A6 




36 


\ Q 


31 


0 


^7 




37 






I 




57 


1<^6 


42 


44 


I 



28 



I 
I 

I 

1 

i 

i 



i 



i 




Table ^ 
(Concinued) 



sponse Category 





A 


B 


C 


0 


Ofrit 




I 




S5 


4 


0 


SO 




16 


2 


3 


1 


51 


7 


0 


?0 


30-> 


1 


S2 


2 


31? 


fk 




0 


53 


9 


30'; 


3 


13 


0 




«> 




34 




1 




?P6 


in 


17 


17 


0 






?so 




17 


2 




?P 


70 


I? 


?6^ 


. 2 




«; 


?68 


30 


2*5 


? 


/ 7 




0 


246 


46 


? 




160 


76 


«9 


4 


1 




6^ 




11 


41 


0 


ft? 


/|. 




790 


16 


1 




4 




26? 


44 


I 


ft4 


70 


' 67 


707 


?S 


I 






7 


100 


23 


0 






?s 


3 


101 


0 


r 






60 


19? 


2 


w J n 




1*^9 


101 


39 


1 


VI » 




54 


97 


I4S 


3 




16 




60 


I in 


3 


7 1 


96 


6 


35 


193 


0 


7? 
• c 






77 


P3 


7 


7*^ 




4S 


51 


204 


1 


. 74 




10 


f 


307 


1 


7«i 


SI 


34 


136 


106 


3 


7 fx 


'>^ 


36 




745 


2 


77 




44 


731 


27 


2 


7<^ 


* 167 


lOl 


10 


51 


1 


70 


73 


?6 


1 S4 


69 


3 


o w 




7'>^ 


3? 


?7 


0 


ft 1 


7 




30 


7B4 


0 






4? 


69 


6 2 


4 






46 


1 ? 


1 '^l 


I 




1 * 


74 


7?3 


16 


3 


85 


1 3 




?Q5 


11 


1 


«6 


9 


16 


16 




I 






17 


47 


1^ 


2 






3« 


11 


77 


2 




q 


9 


31 




2 


90 


93 




26 


16 


0 


«»1 


•>s 


7 7^. 


7 


71 


7 


«7 




1^ 


725 


79 


I 


03 




43 


19 


707 


2 




17 


16 


71 


7?3 


3 




'■3 


39 


IS2 


76 


n 


06 


13A 




161 


21 


7 
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Table 5 



Reeponse-Category Weights for Each of the 96 Iteiss in 
Form D of the Ejq)erimental Heading Test* 
Sample 1R 



r.»ni 



1 
2 
3 
4 
5 
6 
7 
8 
9 
10 

li 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

3£ 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 



Resnonstt Cate g ory 



0 



Omit 



-O.S49dl 
-0.50487 
C. 38205 
-0.60420 
-0.43642 
-0.26o31 
0'.07093 
0.15319 
-0.15550 
0.69802 
-0. 60931 " 

0.63985 
-0.97354 
C.1C088 
-t. 18930. . 
-0.10674 
0.12U1 
-0.89165 
-0.38126 
0.37797 
-0.94553 
0.20282 
-0.58315 
-0.4H679 
0.C5557 
-0.59765 
0.03626 
-0.96956 
0.28686 
-0.43791 
-0.69006 
-0.7*1723 
-1. 21973 
-0.04370 
-0.79579 
0.25935 
-1.76170 
0. 12366 
-0.3v?65 
-0.83994 
0.14206 
-0.59322 
-0.92235 
-0.81941 
-0.76369 
-b. 33582 
-0.39661 
-0.72018 



•0.54505 
-0.79932 
-0.9173«» 
-0.29dlC 
-0.23993 
-G.461S 3 
0.36746 
0.13223 
-0.23631 
-0.^142J6_ 

0*.3'9C.ftfl. 
-0.02133 

0.05561 
-1.14264 

0. 19522 



-0.53226 

0.10705 
-C. 79905 

0.13243 
-0 . 39 403 

0.3S673 
-0,66703 
-0.95256 

d. 17901 
- 0.51031 
-C,4_8C89 
-0.08228 
-1.24741 
-0.51748 
-0.49834 



1.09859 
-0,2028 7 
-1.07047 

0.26624 
-0.04913 

0.18567_ 
-0.36621 
-0.83951 

0.40406 
-0.75566 

0.06635 
-0.04652. 
-0.97*53 
-0*8^7C3 
-1.15615 
-0.63114 
-0. 6^062 
-0.7204 3 



0.12406 
•0.07671 
-0.5912U 
•0.C8021 

0.28737 
-0.21712^ 
-0.59i61 
-0.49343 

0.22479 
-0.48?8a_ 

0. 10 80 5. 
-1.23783 
-1.13669 
-0.897U2 
-0.97258_ 





0.0 

o.u 

0.63005 
-1.84406 

0.70613 
' 0.15021 

0.0 

0.45666 
0,44758 



0.05836 
-0.878,54 
-0.52499 
-0.75293 
-0.83386 
-0.26196 



0.0524 7 
0.08205. 
0.14^76 
-1.00236 
' .02619 
-0,38976 



-0.27715 
-0.59839 
-0.81487 
-0. 15547 
-0.27488 
-C. 77669 



-I.CIOOC 
-0.75352 

0.10537 
-1.C06CC 
-0.52006 

C. 21603 



-1.00857 
-1.08150 
-I. 18643 
-1.H2120 
-0.57481 
0.36294_ 
0.15654 
-i. 53242 
-0.45172 
-J.71022 
-1.19724 
-0.96752 
'0733672 
0.16604 
-0.39770 



-0185488 
0.26712 
~C. 30462 
-0.8770* 
-1.407^9 
-0.39453 



-0.36474 
-0. 24172 
-0.74723 
-0.88058 
0.24933 
0.26710 



a.5S581 
3.0 
0.0 
0.0 

o,c 



0.8659T 

0.0 

0.0 
-0.836 34 
-0.830. 34 

O.u 



-0.17489 
0.3835C 
-0.11845 
-0.91t''?8 
"6.0 
-0.73442 

07i5457" 
-0.62178 
-0.71526 
0.24006 
0.37152" 
•0.61241 
0*. 2 7 02 8' 
-0.V7091 
-0.28759 
0.15016 
-1.56S08 
-0.45492 
■-i.*b2 104 
-0.33108 
0.15673. 
0«.2847r 
-L. 97 742 
-J. 66772 



"-0.56819 
-0.65715 
0.35762 



^.69724 
-C. 20865 
-0.18623 



0.0 

C. 63005 
'6.0 
Oj.0 

"0,6 " 
0.0 

u.o 

0.0 
0.0 
0.0 
0.0 
0.0 

0.o3L)05 

Ocb 

0.0 

Ocd 

0.0 
-1.50440 

0.0 

0.0 " 
-0.57743 

"6.0 

0.0 

'-0.4U131 
-0.46966 
' 0.0' 
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Table 5 
(Continued) 



Item 



Respor.se Category 



■0 



49 

50 

51 

52 

53 

54 

55 

56 

57 

5tt 

59 

60 

61 

bl 

63 

64 

65 

66 

67 

68 

69 

70 

71 

72 

7J 

74 

75 

76 

77 

16 

79 

80 

81 

82 

83 

84 

85 

86 

87 

88 

89 

90 

91 

92 

93 

94 

95 

96 



-0.48946 
0.17035 
0.07719 
-1. 171fc8 
-0.91422 
-0.83289 
-0.51722 
-1.12294 
-0.43003 
-1.12208 
-0.80843 
-0.41215 
-1.28154 
-o. 85450 
-0.43123 
-0.O4325 
-O. 58103 
-0.55856 
-0.48889 
-0.51529 
-0.34100 

0. 27753 ■ 
-0.36447 
-0.00644 
-0.94453 
0.27428 
-0.71445 
0.2£552 
0.15553 
-0. 08066 
-0.37711 
-0.15335 
-0.H205'/ 
-0.14981 
-0.45690 
0.10057 
-0.94678 
-C. 95079 
-0. 55418 
-1.14895 
-0. 70460 
-1.21565 
-0.93092 
0.42951 
-0.05191 
-0.31836 
-0. €4S89 
-0.42227 



-l.o314fc 
-0.2t)24? 
-0.75770 

C. C9267 
-0.80045 

0.25225 
-0.88269 
-0.95464 
-0.37418 
-1.46992 
-0.8139 1 
-C. 77713 

0.23510 
-0.98343 

0.2337C 
-0.62278- 
-0.70261 

0.54237 

0.30866 
-0.39812 
-0.56935 
Hi. 21 73 5 

0.34329 
-0.45993 
-1.25367 
-0.720 9 6 

0.22505 
-C. 95000 
-0.55222 

0.2505.2 

0.30C38 
-0.36774 
-0.10806 

0.26748 
-0.54833 
-0.77379 

0.12302 

0.20035 

0.236C S 

0.16977 
-0.854o9 
-1.2798C 
-0.81>^23 
-0.24109 
-0.94309 
-0.42421 
-0.52586 

0.17246 



-2.26497 
-C. 98577 
^1.43687 
-r".7l469 

0.08844 
-0.61552 

0. 17238 
-1.11979 

C. 28950 

"-i.caoori" 

-0.85472 
0.43145 

-0.86049 
0.23492 

-1.19315 



0.04220 
•0.57557 
•1^28170 
■1. 31 195' 
•0.72882 
-0.79147 
-1.06686 

0.10706 
-0.30889 
'0.08Sr2" 

0.29483 
-0.21132 
-0.73174 
-0.46082 
-0.87565 



0.0 
0.0 
0.0 



0.26161 
-C. 50306 
-0.52661 

0. 1150^0' 
-0.06931 



ir;2536/" 
-0.69858 
-0.29518 

0.25 836 
-0.0459'8 

0.46269' 



-O.Z580b 
-0.55781 
-1.11682 
0.27647 

-X*8i9aA 
-0.46675 
-1.15296 
-0.21153 
Q..Ji49l6_ 
-G. 75837 
-0.35568 
-0.30364 
-C. 79841 
0.C9137 
C.10254_. 
-0.46287 
-1.14898 
-C. 38044 
-0.95862 
0.18656 
0.14934 
-0.58027 
-0.67909 
-0.42788 
0.40960 
0.35162 
-0.81529 



-0.41106- 
— 0.2^CB^1' 
-0« 64276! 
-rLU2i42A . 
-0.67804 
-1.31763 
-0.38192 
-_0.4i36l_ 
-0.37 350 
0.14306 
0.26631 
-0.01S87 „ 
0.45889 

-0.93 839 _ 

-6.97314 

-0.68865 

-0.48125 

-0.47S43 

-0.65606 

-1.01271 

0.49391 
-0.29803 

0.30133 
-0.70631 
-0.56961 
-0.23992 



0.0 
-C.26056 
-b.83b34 

0.0 

0. 0 
-1.07130 

0.0 
-2.52201 

' 0 .cr^ 

-0.83634 
" "D".TJ~ ~ 

0.0 
— 070 
-0.91422 
0.0 
0.0 
"0. CO 12 3 

0.0 

— Oiu 



0.0 

0.0 

.-ii.Z60i6* 
-0.8 3634 
0.0 

0.13015 
0.0 
0.0 
-0.89237 
0.0 
0.0 
0.0 

O.u 

6.0 

0.0 
0.0 
0.0 
o.b 
o.u 

-0.17330 
0.16475 
-0.17330 
-0.26056 
-0.7 8 341 
-1.30625 
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Table 6 



Frequency of Response to Each P: espouse Catecor 
Form D of the LxperiTiental Reading Test, 
Sample 1H 
(N = 531) 



Kcsponse CatPKory 



1 rem 




B 


c 


!) 


Omit 


I 


i7_.. 


4 


41 


269 


0 


2 


22 


23 


2 80 


6 


0 


3 


220 


8 


76 


27 


0 


4 


32 


30 


231 


37 


1 


5 


33 


16 


89 


192 


1 


6 


96 


56 


152 


24 


3 


• 

7 


75 


155 . 


43 


57 


1 


8 


156 


116 


22 


37 


0 


9 


143 


31 


106 


48 


3 


10 


111 


82 


47 


88 


3 


II 


31 


110 


79 


107 


4 


12 


34 


215 


73 


9 


0 


13 


1 


316 


6 


8 


0 


14 


"290 


• 11 


27 


3 


0 


15 


18 


264 


37 


12 


0 


16 


27 


12 


1 


290 


1 


17 


214 


83_ 


13 


21 


0 


L8 


27 


6 


18 


280 


0 


19 


53 


22 8 


38 


11 


1 


"20 


196 


53 


65 


16 


I 


21 


19 


241 


58 


13 


0 


22 


178 


22 


50 


81 


" 0 


23 


68 


39 


14 


209 


1 


24 


54 


137 


21 


119 


0 


25 


309 


14 


1 


7 


0 


26 


13 


279 


39 


b 


0 


27 


314 


2 


6 


9 


0 


26 


12 


22 


11 


286 


0 


29 


241 


33 


37 


20 


0 


30 


14 


19 


2 94 


4 


0 


31 


13 


63 


10 


245 


0 


32 


97 


10 


7 


217 


0 


33 


23 


28 


268 


12 " 


• 0 


34 


1^9 


14 


22 


146 


0 


35 


26 


23 


252 


29 


1 


36 


194 


11 


59 


67 


0 


37 


5 


2 


36 


288 


0 


38 


286 


34 


9 


2 


0 


39 


55 


178 


31 


67 


0 


40 


6 


281 


10 


33 


1 


41 


243 


6 


24 


58 


0 


42 


7 


35 


28 


261 


0 


43 


33 


21 


Z8 


248 


1 


44 


38 


12 


261 


20 


0 


45 


27 


18 


245 


41 


0 


46 


6 


221 


22 


81 


I 


47 


23 


223 


22 


62 


I 


48 


18 


23 


140 


150 


d 



Table 6 
(Continued) 



Respons^ Category 



Item 


A 


B c ' D Omit 


49 


3 




1 


321 


c 


50 


245 






o 




51 


307 


1 5 


5 


4 


0 


52 


8 






^. . 




53 


I 


1 'A 




20 


0 


54 


..'"'..21.. 


? A. A 






1 


55 


33 


c 


coy 


20 


0 


56 


8 




7 
1 




n 


57 


25 


ft 7 


1 OO 


29 


2 


58 


8 


7 






0 


59 


18 


c. *♦ 




246 

C "T O 


1 


60 


. .62 


P. 4 


1 74 


43 


0 


61 


14 






20 


1 


62 


18 


^3^JP 


c p 


17 


0 


63 


61 




1 1 


19 


io 


64 


29. 






236 


0 


65 


10 




P40 


42 


1 


o6 


56 


1 AH 




112 


0 


67 


57 


P I 
^ j> 




197 


b 


68 


23 


^ •* 




85 


jk 


69 


168 




■J 


144 


0 


70 


176 




O •/ 


39 


0 


71 


69 


1 Q 1 




56 


0 


72 


147 






135 
^A «^ ^ 


0 


73 


19 




25 I 


41 




74 


248 




46 


8 


1 


75 


8 




60 


27 


1 


76 


262 


S^ 


7 


7 


0 


77 


232 


I A 


4Q 


32 


2 


78 


46 


Cli o 


1 5 


62 


0 


79 


44 


POP 




53 


0 


80 


38 


•t 


20 


225 


3 


81 


31 






157 


o~ 


82 


30 


172 


50 


79 


0 


83 


105 


44 


- 


151 


0 


84 


228 


18 


68 


17 


0 


85 


9 


289 


10 


23 


0 " 


86 


37 


2 72 


9 


13 


0 


87 


37 


217 


63 


14 


6 ■ 


88 


13 


266 


11 


41 


0 


89 


34 


14 


263 


20 


0 


90 


7 


10 


292 


22 


0 


91 


21 


47 


65 


195 


3 ■ 


92 


161 


67 


61 


40 


2 


93 


42 


14 


56 


216 


3- 


94 


47 


102 


167 


-14 


1 


95 


12 


59 


' 2106' 


"52 


2 


96 


67 


236 


8 


19 


). 
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W2 scoring procedure to the Form-D answer sheets of 328 examinees in 
Sample 2R-C. Similarly, the W4 weights for Form D were obtained by 
treating each of the 96 itens in Form D as a variable for predicting 
total scores on Form C obtained by applying the W2 scoring procedure 
to the Form-C answer sheets of 331 examinees in Sample 2R-D. 

Prior to calculating the partial regression coefficient for each 
item in Tests C and D for use in predicting total scores, the latter 
were converted into normalized standard scores with a mean of 50.000 
and a standard deviation of 21.066. The partial regression coefficients 
that were obtained for scores on the 96 items in Test C are presented in 
Table 7. Analogous data for scores on the 96 items in Test D are shown 
in Table 8. 

These regression coefficients based on data obtained in Samples 
2R-C and 2R-D were used to modify the response-category scoring weights 
established for items in Test C and D on the basis of data obtained in 
Samples IR-C and IR-D. For example, the response-category W4 scoring 
weights for item 1 of Test C were obtained by multiplying each of the 
five W3 response-category weights (as shown in Table 3) by the partial 
regression coefficient for this item (shown in Table 7). The W4 scoring 
weights for the remaining 95 items in Test C and for the 96 items in 
Test D were obtained in an anaJ.ogous manner. The resulting W4 weights 
are called "adjusted response-category weights." Tables 9 and iO show 
the multiple correlations and associated statistical data. 



Estimation of Parallel-Fona* Rellg^ility Coefficients for 
Total Scores on Tests C and D Obtained by 
Four Different Scoring" Methods 

To estimate parallel-forms reliability coefficients for total 
scores on Tests C and D that were obtained by four different scoring 
methods, the Form-C and Form-D answer sheets for examinees in Sample 
3R (who took both forms) were used. Thus, eight total-test scores were 
obtained for each of the 360 examinees in Sample 3R. It should be 
noted that the correlation coefficients among these eight scores were 
based on data that had had no influence in determining the scoring 
weights used in methods Wl, W2, W3, or W4. As a result cf this cross- 
validation procedure, the coefficients are entirely free from spurious 
inflation caused by capitalization on chance effects. Hosier (1951) 
discussed the effects of cross-validation so they need not be presented 
here in detail. 

Table 11 shows the four parallel-forms reliability coefficients 
for Tests C and D as underlined entries along with certain other inter- 
correlations of the eight scores obtained in Sample 3R. The underlined 
entries may properly be treated as reliability coefficients of either 
Form C or Form D because they are correlation coefficients between sets 
of test scores constructed to measure the same mental functions and 



Partial Regression Coefficient for Each Variable* in Form C When the 
Criterion Is Normalized Standard Scores on Form D 
Sample 2R-C 
(M = 331) 
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Items were considered "variables" in the multiple regression equation and are so 
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Multiple Correlation and Significance-Test 3urr.mary for the 
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Intercorrclations, Means, and Standard Deviatlo.-.G o.'. Several 
Total Scores on Tests C and D Obtaine-i by Four Jicorinf? Kf-.^ode i 

(i; = 360) 

(Parallel - Forms Heliability Coefficients are Underlined) 
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expressed in normalized standard scores having similar means, standard 
deviations, and distributions. 



It will be noted that the parallel-forms reliability coefficients 
increased from .882 for scoring method Wi to ,883 for method W2 to .89A 
for method W3. It had been expected, on a a-priori grounds, thai method^i 
WI and W2 would yield insignificantly different reliability coefficients 
since Tests C and D were administered under essentially untined condi- 
tions with directions that read: "Mark items even if you are not sure 
of the answers, but avoid wild guessing." There was no reason to expect 
that variations in gambling tendencies among the examinees would markedly 
affect their scores. 

It had, however, been expected that scoring method W3 would yield 
a significantly higher reliability coefficient than either of methods 
Wl or W2. Data from studies by Davis and Fifer (1959), Hendrickson (1971) 
and Reilly and Jackson (1972) supported this expectation, which was 
realized. 

Finally, on a-priori grounds, it seemed reasonable to suppose that 
the "purification" of total scores likely to be brought about by scoring 
with method W4 would lead to obtaining a higher parallel-form reliabilit\ 
coefficient with scores obtained by method W4 than with scores obtained 
by method W3. This expectation was not confirmed by the data. 



Tests of Significance of Planned Comparisons 

Four planned comparisons were marie to test specific hypotheses 
of interest. The first of these was a null hypothesis that may be 
written as 

^Wl^l ^2^2 

This hypothesis may be tested by converting both ^C^^iDyi and ^C^f2^^' 
into their corresponding values of Fisher's and forming a t ratio as 
follows : 

^ ^ %1^1 %2PW2 . 



(2-2r 



z- 2 ^ /n-3) 



The value of the correlation coefficient between Vs can be 
estimated in large samples by means of an equation given by McNemar 
(1949, p. 125, equation 48). 



For the difference of .001 between the parallel-forms reiiabiJity 
coefficients of total scores on Tests C and D, the t ratio was ,6364 
with 357 degrees of freedom. Thus, the null hypothesis is accepted. 

The second planned comparison was chat between the reliability 
coefficients of Tests C and D scor^^d by nethods W2 and W3, The statis- 
tical hypotheses tested were: 



^o' ^C^2%2 ^^W3%3 ' 

%U^2 ^ ^^W3^3 

If a t ratio is formed by the same procedures used in testing 
the statistical significance of the first planned comparison, the value 
obtained is -2.3965 with 357 degrees of freedom. This result leads to 
rejection of the null hypothesis and acceptance (at the ,01 level of 
significance) of the directional alternative, Hj ; We conclude that 
response-category scoring yields a parallel-forms reliability coeffi- 
cient of Tests C and D greater than does scoring with the conventional 
correction for chance success. 

The third planned comparison was between the reliability coeffi- 
cients of total scores from Tests C and D obtained by methods W2 and 
W4. The null hypothesis may be stated as 



^W2n^2 ^WA^WA 

If a t ratio is formed by procedures analogous to those described 
above, the value obtained is 6 .67 20. Consequently, the null hypothesis 
(Hq) is rejected. Thi« leaves us in the position of concluding that 
total scores on Tests C and D are less reliable (at the .01 level of 
significance) when they are obtained by method W4 than when they are 
obtained by method W2. This result was not expected* 

The fourth planned comparison was made between reliability 
coefficients for Tests C and D when scores were obtained by methods 
W3 and W4. The null hypothesis may be stated as 



The t ratio for testing this hypothesis is 8.3795. Again, the 
null hypothesis must be rejected. The data, as in the case of the third 



^3 



planned coinpariscn, ran counter to our expectations since they indicaf 
that the reliability coefficient of Tests C and D are lower whft i the 
scores are obtained by niethod W4 than when they are obt ained by method 
W3. 



CHAPTER IV 
THE PREDICTIVE VALIDITY STUDY 



Purpose 

The purpose of the predictive validity study in this investi- 
gation of the effects of differential choice weighting on test 
reliability and validity was to compare the predictive validity 
coefficients of the Davis Reading Test (Series 1, Form D) when scores 
were obtained by four different methods. 



Test Used 

The Davis Reading Test, Series 1, Form D (Davis & Davis, 1962) 
fs designed to measure five categories of reading skills and is intended 
for use in grades 11 and 12 and with entering college freshmen. The 
test is made up of 80 items and is administered in a 40-minute time 
limit. Two successive equivalent scales of 40 items each are incorpo- 
rated into the test. Since virtually all examinees try the first 
scale (40 items) in 40 minutes and very few examinees have time to 
try 80 items in 40 minutes, two scores can be derived from the test. 
The first is a Level-of -Comprehension score based on the first 40 items 
and the second is a Speed-of -Comprehension score based on the entire 
80 items in the ^est. In this studv only the Speed-of-Comprehension 
score was obtained for each examinee although the scoring weights 
assigned to the five choices in each item, to omissions, and to failure 
to reach an item in the time limit could hc^ve been used to obtain 
Level-of -Comprehension scores based on the first 40 items only. 



Samples 

As part of the regular placement testing program. Form D of 
Series 1 of the Davis Reading Test (Davis & Davis, 1962) was adminis- 
tered to freshmen upon entrance into the University of Pennsylvania. 
Answer sheets from this test were available for 3,840 students tested 
during the period 1968-1970. 

Complete data, including grade-point averages at the end of 
their freshman year, could be located for 2,869 of the initial sample. 
This group, which included students from several undergraduate divisions 
of the University, was dividec at random into three samples of 953 cases 
each. Random selection within undergraduate division was not done. 



Thus, three groups, labeled IV, 2V, and 3V constituted the three samples 
needed to conduct all steps in the investigation of the effects of weight 
ing on predictive validity • Table 12 provides descriptive statistics 
pertaining to the three samples used in the predictive validity study. 



Scores To Be Compared 

The four methods for obtaiiJng scores to be used in obtaining 
predictive validity coefficients for the Davis Reading Testb (Series 1, 
Form D) in Sample 3V are as follows: 

Wl: For each item, examinees were credited with 1 point for a 
correct response, 0 for an incorrect response, 0 for omission (failure 
to Ljrk any choice as correct after reading the item), and 0 for not 
marking any choice as correct because of lack of sufficient time to 
consider the item. The total test score consisted of the sum of the 
item scores in it. This is commonly called "number-right-scoring.*' 

W2: For each item, examinees were credited with 1 point for a 
correct response, -l/(k-l) for an incorrect response (where k repre- 
sents the number of choices per item), 0 for omission, and 0 for not 
marking any choice as correct because of lack of sufficient time to 
consider the item. This is commonly called "formula-scoring" and 
embodies a correction for chance success* 

W3: For each item, examinees were credited with scores based on 
weights assigned to each choice and to the response categories of 
omission (failure to mark any choice as correct after reading the item) 
and "not read" (failure to mark any choice as correct because of lack 
of sufficient time to consider the item). Each scoring weight was made 
proportional to the mean criterion score for examinees who fell in a 
given response category. The criterion scores for establishing scoring 
weights for the Davis Reading Test were grade-point averages for examinees 
in Sample IV. The total scores obtained by method W3 consisted of the 
algebraic sum of the scoring weights for the 80 response-categories 
(one per Item) selected by each examinee on the Davis Reading Test. 

W4: " For each item, the examinees were credited with scores based 
on modified scoring weights assigned to each choice and to the response 
categories of omission and "not read." Each of the scoring weights 
obtained by method W3 was "modified" by multiplying it by the partial 
regression coefficient that would maximize the multiple correlation 
between a set of linear composites of the 80 item scores in the Davis 
Reading Test and a set of criterion scores. The criterion scores were 
grade-point averages for examinees in Sample 2V. 

It should be noted that method W3 and method WA differ from those 
described in the reliability study (p. 22). In the case of the predictive 
validity study the category of "not read" is considered as a valid item 
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Table 12 



Descriptive Statistics 
For Grade-Point Averages 
For Three SaT.ples of University Freshman 
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*r40"e: Thet'o aro nor.T.al'lzod standard scores with -ican = 
•7 .0% and 3.D. = 21.066. An approximate '*table-looR-up*' 
nroceauro resulted in rr-inor variations fror^i those valaes ior 



h7 



response category* Thus, in the present study, five choices plus omits 
and "not read" comprise an array of seven item-response categories for 
each item in the Davis Reading Test* 



Determination of Scoring Weights for Method W3 

Answer sheets for the 953 exacinees in Sample IV were used to 
obtain W3 scoring weights for the seven possible response-categories 
for each item of the Davis Reading Test. Criterion scores used to 
obtain the weights were first-semester grade-point averages after their 
transformation to normalized standard scores with a mean of 50.000 and 
a standard deviation of 21.066* 

The mean criterion score of those examinees who fell in each 
response category for each of the 80 items in the Davis Fveading Test 
was calculated. These means were then transformed linearly so that, 
within each item, the sum of the products of each transformed mean and 
the nximber of examinees entering into its calculation was made equal to 
zero. The transformed mean criterion score for each item-response 
category was used as the weight in method W3. 

The W3 response-category weights for the Davis Reading Test are 
shown in Table 13. The numbers of examinees on which the weights are 
based are shown in Table 14. 



Determination of Scoring Weights for Method W4 

For each of the 953 examinees in Sample 2V each item of the 
Davis Reading Test was scored by W3 weights established in Sample IV. 
With each of the 80 items considered as an independent variable in a 
linear composite for predicting the grade-point averages for the examinees 
in Sample 2V, a partial regression coefficient was obtained for scores 
in each predictor variable. Coefficients obtained in this manner tend 
to maximize the relationship between the criterion variable and the compos- 
ite of variables for which the coefficients were obtained. 

Prior to calculating the partial regression coefficients for each 
item in the Davis Reading Test for use in predicting grade-point averages, 
the latter were transformed into normalized standard scores with a mean 
of 50.000 and a standard deviation of 21.066. The partial regression 
coefficients that were obtained for scores on the 80 items in the Davis 
Reading Test are shown in Table 15. 

The partial regression coefficients established on the basis of 
data from Sample 2V were used to modify the response-category weights 
established for items in the Davis Reading Test obtained using data from 
Sample IV. For example, the W4 scoring weights for item 1 were obtained 
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Table 14 



Frequ«»ncy of Response to Each Response Category 
in the Davis Reading Test, Series 1, Form D 
Sample IV 
(N = 953) 



Item 


A 


B 


C 


D 


E 


Omit 


NR 


1 


574 


22 


q 


1 in 


7 1 7 


J 1 

C L 




2 


•54 


lb 


306 


561 


8 


6 


0 


3 


ICl 


803 


>i 
«j 


1 9 


9 7 


c 


n 
u 


4 


794 


47 


3 


Zb 


72 


11 


0 


5 


211 


530 


142 


50 


ft 


1 7 


0 


o 


20 


87 


62 


634 


128 


Zl 


0 


7 


PI 




2? 


7 


c 


7 




d 


13 


44 


91 


766 




14 


5 


0 i 


Q 






9 T 


612 


44 


^ fl 


n i 

U 4 


10 


109 


772 


33 


4 


22 


12 


0 


1 1 


fl07 


^9 


? 




S 1 


1 5 




12 


9b 


31 


22 


79 


713 


12 


^ , 

0 


13 


?.5.. 




1 n 




877 




0 1 


14 


617 


254 


16 




42 


19 


0 j 


15 


_..37 23 


5Q.*i. 


- 


•^^ 


8 1 


0 ! 


16 


193 


5 


56 


16 


666 


-r W 1. 

17 


0 i 


1 7 


782 




54 


18 


9 


2 


0 1 


18 


12 


871 


15 


51 


4 


c 


! 

0 j 


19 




- 4.a 


789 


20 


6 1 


1 1 


0 } 


20 


4 


104 


123 


11 


702 


— ^ 

9 


^ ■ - - 1 

0 ' 


21 


._4J 


21. 


_236 


609 


28 


1 1 


0 i 


22 


44 


274 


71 


46 


499 


19 


y 1 

0 ' 


23 


105 


136 


64 5 


39 


1 1 


19 


0 ! 


24 


92 


15 


14 


110 


717 


5 


0 i 


2p - 


14Z. 


2o , 


75 


457 


225 


26 


c ! 


26 


li 


23 


725 


94 


bO 


32 


0 1 


27 


. 192 


45 


145 


21 


52 8 


22 


0 i 


28 


9 


527 


291 


86 


12 


28 


0 j 


29 


7o 


4 


16 


779 


48 


28 


0 ' 


30 


71 7 


44 


29 


72 


4 


67 


0 ! 


3.1 


20 


765 


73 


31 


27 


37 


0 ! 


32 


55 


39 


802 


5 


0 


52 


0 ! 


■• 53i 


41 


335 


38 


475 


12 


52 


0 i 


34 


17 


■jI 


11 


5)03 


318 


42 


0 i 


35 


ib2 


64 


527 


52 


100 


48 


C ' 


36 


26 


5J2 


297 


6 


68 


24 


0 


37 


9 


12 


814 


66 


34 


18 


0 i 


3 J 


62/ 


59 


54 


79 


87 


47 


0 i 


39 


. o32 


232 


49 


8 


21 


11 


c : 


40 ' 


23 


o54 


11 


0 


49 


16 


0 



51 



Table lA 
(Continued) 



Item 


A 


B 


C 


D 


E 


Omit 


MR 


41 


1 3 


4/^ 




1 PA 


lie 


C. \J 


Q i 


42 


15 




1 u 1, 


Q A 
V o 


L. 1 


1 I 
1 1. 


1 t 

1 1 


43 






J 


-> O 


a 1 7 
Oil 


i 1 


C 


44 


73 J 


70 


A 


7 4 




'} 7 
^ 1 


2 


45 


11 


1 




f u *t 


J O 


^1 i 


■J 


46 


25 


8 1 




O \J X 


? 1 




c 


47 


2 5 

Cm. J 






A 


Oc 1 




c 


48 


2 1 


fl 


w (J 


o ^ 


1 L. 7 


L.L. 


O 


49 


33 






p. 
o 




■3 


Q 




1 


296 


558 


5 


76 


6 


9 


51 


12 9 


36 


1 i 6 


2C 


56 7 


/i4 


1 A 
X O 




C u 


c 8 


588 


13 1 


42 


65 


19 


53 


-» 




559 


88 


12 1 


1 1 ft 

X X o 


> 1 






CAT 


1^6 


111 
112 


24 


115 


24 


55 


2i 


j7 


52 


17 


717 




? 7 




^ !7 ^ 


102 


0 






1 80 


46 


57 


253 


499 


7 


10 


56 


^ 1 

t' X 


C J 


58 


15 


39 


52 


653 


65 


4 7 


11 


59 


11 


666 


78 


2 1 


X <J 




Q 


60 


43 


12. 


4 


10 ^ 


\m ^ ^ 


-/ 


Q A 


61 


31 


11 


4o 


530 


111 


109 


115 


o2 


- 55"' 


412 


36 


69 


26 


171 


134 


63 


42 


82 


232 


iJ77 


21 


140 


153 


6^ 


458 


134 


L6 


12 


21 


126 


166 


65 


578 


2 


'<5 


A 


80 


65 


179 


6o 


11 


58 


24 


511 


2 


82 


265 


67 


83 




36 


26 5 


11 


145 


2a& 


Ot) 


270 


ill 


129 


13 


47 


dC 


303 


69 


41 


31 


409 


29 


10 


•102 


331 


1 "70 ■ 




6"' 


" hsL 


.■VIS 


"VV 


"l 1*5 


35 7" 


7 1 


:>05 


17 


33 


5 


87 


97 


409 


-7? . ... 


16- 




IQ 


1 9 








'7 3 


86 


238 


33 


61 


38 


55 


442 


7^ 




1^ 




?n 


34n 


4>> 


4ftl 


;75 


1 


2 


30o 


3 


71 


2.3 


487 


.. - - 7o 




.... u3.- 






n5 


5... 


.'jhA .. .: 


77 


142 


181 


17 


9 


2 


7 


595 


la.. _ 


3jl)_ 


._13... 




._i.77_. 




13 




79 


3 


31 


3o 


12 


193 


22 


654 






t\ 


0» 


192 




C 


666 



52 
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Table 15 



Partial Regr;;sslon Coefficients for Scores of 80 Items 
in the Davis Reading Test, Series 1, Form D, 
for Predicting Freshman Grade-Point Averages 

Sample 2V 
(N = 953) 



VARIABLE 


3 


BETA STO 


fcRRUR 




VAR080 


0.61413 


C . 03623 


0.6U125 


1 .043 


VAKOOi 


-0.4:)83 0 


-0 .05987 


0. 23584 


3.454 


VAKUJ2 


-0.0 7856 


-0.02202 


0.15242 


C' . 418 


VAKtOi 


-0.iOU3 


-0.03485 


0.32609 


0.844 


VAKOOt 


-0.r3o83 


-0.0185:2 


0.2fc869 


0.30j 


J/ ARC Op 


0. 13712 


_C. C2C3_6 


g_.2330 7_ 


0.346 


VAK006 


0. 36427 


0.05 1C8 


'6.27 7 18 


Ti922" 


VAR007 


-0.72027 


-0.C5406 


0 .43636 


2.722 


VAR0C8 


-0.CC934 


-C*00C89 


0.37235 


0.001 


VAKOJV 


0.1O341 


0*0347C 


0. 15419 


1. 123 


VARoiO 


0. 12^37 


C. 02096 


0.2017E 


0.37 4 


VARul i 


C.J457.3. 


_C. 02010^ 


_0L^254_7_1 


0.327 


VAR012 


0. 1199 9 


0 •02435* 


0. 16259 


0.545^ 


VAKOli 


0. 12444 


0.01226 


0 .36300 


0.118 1 


VAROW 


-1.3750C 


-0. 14824 


0.33877 


16.474 


VAROtii 


0.3^815 


C. 03675 


0.21684 


2.388 ! 


VAROlo 


O.C9597 


O.CZZOO 


0. 14859 


0.417 


VARO 1/ 


OeC5858 


0. 0C825 


0. 24536 


0.057 


VAR018 


0.5243 1 


0.04393 


0.38C97 


1.894 


VAR0i9 


_ 0.07763 


O.0C759 


0. 35986 


0.047 


VAR02U 


' -o'.coioc ' 


-C.C1051 


0. 19592 


C.097 1 


VAKO^l 


0.11179 


0. 01851 


0.19895 


0.316 


VAR022 


0.22649 


G. 02317 


0.32267 


0.493 


. _ y AKv2 i 


u .OuJilJl. 


C.00057 


0. 39234 


C.OOO 


VAR 


J. 0b:)49 


0. JJ9 18 


0.21420 


0.^06 7 


yAR'J4 3 


0.20465 


0.32 143 


0.31634 


0.413 


VAROZo 


-0 . 1J349 


-0.01428 


0.25627 


C. 163 


VAR027 


u. 11124 


0 . C2066 


0.13975 


0.344 


VARU^J 


0*0-t78C 


0. 00928 


0. 1794 


0.071 


yARjj2 9,: 


J, .0 .0.1.64.9. 


0.002 17 


0 .26752 


C.0J4 


VaRu3 J" 


0. iCCC9 


0.039o7 


0.18189 


2.722 


__yARO->A 


A*3.U.29_ 


. 0..C5 2 28, 


0. 26255 


2. 1j9 


VARO:?^ 


0# J^t. o3 


0.*0C8l7 


0.21 101 


0.0 V9 


VAR03i 


0.29914 


0.07911 


0.128fcO 


5.411 


VARO J'fr 


-0 . 1^346 


-0.03004 


0.2161 t 


0.801 


VAROJt?._._ 


0.36551 


. 0.C4220 


0.27662 


1.746 


VAk03o 


J.4^>401 


0. C5 16!: 




'2.g72 


VAROb / 


jj . 0 1961 


0.00260 


0.27374 


0.005 


' "varoVj 


O.^fovBS 


■^3"."0d39'r 


0.18604 


" 6.37'9' 


VAK0-J9 


0.4iJ29 


0. C695^ 


0.21^26 


3.756 


VAK040 


-O.J/b37 


-O.01CC5 


0.24341 


0.096 



53 



Table 15 
(Continued) 



VARIABLE 



B 



BETA 



SID ERROR B 



VAK J^^ 



0. 32 /c i 

u. 076o 5 
J • O J 1 c 3 



■ C , C C i) i 5 
0.0 15t/| 



C . 1 tlVu 7 
0. TCj'V 3 
~0. 107 J-J. 
U.17K'5 
C. 1677^ 



VAKo5 J 



-0._)7191 

r.a>..CA77a. 
0. 1 joCd 

0. 5i)19C 
J. 1362^ 



•J. Jf^il 
■_0.01P.57. 
0.02316 

O.UOulc 
0. 1C53 3 
0.C3A92 



0.2 0^9 A 
-P_. 2 3_1 1 6_ 
0.2ilA c 

0.2140 3 
U. 22091 



2 . v<5 H 

P? 2ii. 
0. 00 J 
0.224 

3. Ijo 
.5..P.ii6_ 

C. 346 
3.72fa 
6 . c 4 6 
0. 715 



VA'<0d2 

.yA;<.oj?.3 

VAKObii 

VAi?05o 
VAP.J5 / 



0. Io704 
l0..1A5_£4. 
-0 . 057 75 
.pil366_C 
0. 03331 
0.32452 



0.02581 
■.P.j:2 5u0_ 
•0.01352 
0.0.3J>P7 
C.COS'Sto 
0.07239 



0.17191 
-P • 3_659 C.. 

6. 18 730 
.P^l_36l 3 

0.26315 

P. 17701 



0.944 
_P..452. 
0.09 5 
PjJo3_ 
0 . 017 
3.361 



\fA«05 6 
.yAi059 
VA-^.OoJ 

VAROuL 

VA'fl0o2 

VAfi06:i 



0. 21736 
-0^35038 
'-0.55996 
_ 0. 34976 
u. 144 lO' 
-0.04645 



C.C5C89 
• 0 . C 3 3 2 0_ 
-0." 13 17 6 

0.09220 



C.C4C9C 
•0.00990 



0.17115 
0.2 0.014 
0. '2 066 9 
_P._15_852 
0. 152P9' 
0. 19824 



1.613 
3.P7 4_ 
'7.340 
4.863 
C. 898' 
0.055 



VA ■<0o't 
_yAK_0_65„ 
VAt^Ooo 

VAKUbci 
VARCo J 



0. C9o36 
l0.2626 2_ 
•0.2 9t>2 7 
0.2195^9 
o. 3 2 70 7 
•0.43299 



0. C1944 
JD^0_442 2 
~0'.03fc8'y 
_0. C5c5? 

0'.^05 24T 
•0.C7483 



0. 1965 4 
0.2209 I 
0.26767 
0. r452 9 
b ." 2 5031" 
0.25900 
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by multiplying each of the seven W3 response-category weights (as shown 
in Table 13) by the raw-score partial regression coefficient for item 1 
(shown in Table 15). The W4 weights for the remaining items were obtained 
in the same way. The resulting W4 weights are termed "adjusted response- 
category weights." Table 16 shows the multiple correlation and associated 
statistical data. 



Estimation of Predictive Validity Coefficients for 
the Davis Reading Test Total Scores 
Obtained by Four Different Scoring Methods^ 

To estimate predictive validity coefficients for total score on 
the Davis Reading Test obtained by four different scoring methods, the 
answer sheets for examinees in Sample 3V were used. Four total-test 
scores were obtained for each of the 953 examinees in Sample 3V. Sample 
3V in no way influenced the determination of either response-category 
weights or partial regression coefficients. Thus, the correlation 
coefficients are free from spurious inflation caused by capitalization 
on chance effects. 

The predictive validity coefficients, as shown in the first row 
of Table 17, were obtained by correlating total- test scores of the 
Davis Reading Test by four different scoring methods with grade-point 
averages for the examinees in Sample 3V. The grade-point averages had 
been transformed into normalized standard scores with a mean of 50.000 
and a standard deviation of 21.066. 

The predictive validity coefficients appear to be quite similar 
for methods Wl, W2, and W3. It had been expected, however, on a-priori 
grounds, that method W2 would yield a higher validity coefficient than 
Wl. 

It had also been expected that method W3 would result in a pre- 
dictive validity coefficient higher than W2. The intent of previous 
studies (Davis & Fifer, 1959; Hendrickson, 1971; Reilly & Jackson, 1972) 
was to improve reliability through techniques of response-*category 
weighting similar to those employed here. None of these studies sought 
to improve predictive validity directly, however. It would seem, 
though, that the same line of reasoning would apply. The weighting 
procedure, as defined here, tends to maximize the relationship between 
item scores and a criterion. If the criterion of interest is grade- 
point averages, weighting a test to predict that criterion should 
tend to maximize predictive validity. 

It seemed reasonable to hypothesize that, if the W3 scoring 
method tended to maximize the relationship between the criterion and a 
test weighted by that method, the coefficient for W3 would be greater 
than for, say, Wl or W2. This expectation was not realized. 
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Table 16 



Multiple for Regression of SPA on the 
80-iten. :}av::; :^e:\d:r;(; Test, 3:>rics 1, For::. D 
Sample 2V 





iMultiple K 
U oqaared 


0.^2D27 

0. 17663 








Standard error of 
raw-score regressio: 
plar.ft 








Source 


d.'f. 


Mean Square 


F 




Regression 
Kesidual 


372 




2.3382' 
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•No^-e: Scores on ail variubl^JG ar*^ expressed as 
ntandiles (Mear. = ^D.XO, S.D. = ?1.066)- Due to a 
" tahlo-iook-up" procedure, minor variations occurred 
in thir. T.ranGfor.T«ation. 
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Finally, It seemed reasonable to suppose that W4 scoring would 
lead to a higher reliability coefficient then W2 scoring or W3 scoring. 
These expectations were realized. 



Tests of Sl:^lflcance of >^ l^:med ComparlSwas 



Four planned comparisons were made. Each compariarn tested a 
specific hypothesis of interest. The first hypothesis was one of no 
difference: between predictive validity coefficients when the l>$c^is 
Reading Test was scored by methods Wl and W2. That is, t\,: P^(;p^)(y 
^(GPA)(W2)* statistical significance of the difference bftcw<»en 

two coeftxclents can bo obtained by applying the equation 



This equation (KcNemar, 1949) takes into the consideration the fact thac 
the correlations being compared were obtained In t:'.a same sample and are 
themselves correlated or dependent. 

For the first planned comparison a t value of less than unity 
was obtained, which Indicates that the dlffe<:esce is not statistically 
significants The hypothesis of no difference between the two correla- 
tion coefficients could not be rejected. 

The second planned comparison tested a hypothesis differ** 
ence between predictive validity coefficients obtained wh^u th<i Davis 
Reading Test was scored by method W2 and by- method W3. The hypothesis, 
stated in null form, was: P/qpa) (m2) " ^(GPAHW3)* appropri- 

ate equation was applied. The t vaiu4S7a8 founa cd -be less than unity. 
Ageln, the hypothesis of no difference could net be reject^id. 

The third planned comparison tested Che hypothesis of no differ- 
ence between PfGPAifW2^ P^GPAWWAV comparison resulted in a t 
value pf 3.397. Thk ptobabllity tftat^a difference in correlation coeffi- 
cients as great as that obtained would occur by chance is l<%ss than .05* 
We conclude that the predictive validity of the Davis Reading Test was 
significantly Improved by using W4 scoring instead of the conventional 
H2 method. 

The fourth planned comparison was -flade between predictive validity 
coefficients when the Davis Reading Test was scored by methods W3 and 
W4. The nu)l hypothesis is stated as: 



t - (ri2-ri3) 



(N-3) (1+r,,) 



^o- P(GPA) (M3) " P(GPA) (W4) 
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The obtained t ratio of 3,858 is statistically significant (d»f. 
950; p<,05). The null hypothesis must be rejected, and we conclude 
that an improvement in the predictive validity of the Davis Reading Test 
for freshmen first-semester grade-point averages can be obtained by the 
use of the modified response-category weights yielded by method W4, 
While W3 scoring does not lead to increases in predictive validity^ the 
W4 method does* W4 scoring involves both Guttman response-category 
weighting and item weighting (based upon multiple linear regression 
procedures) and alters the characteristics of the scores in such a way as 
to maximize their predictive validity for a designated criterion variable. 
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CHAPTER V 



SUMMARY, niSCUSSION, AND CONCLUSIONS 



Summary of the Reliability Study 

Investigation of the effects of various vreighting methods on test 
reliability and predictive validity are reported in the literature 
periodically. Several recent studies (Davis & Fifer, 1959; Hendrlckson, 
1971; Rellly & Jackson, 1972) have reported mixed results using a differ- 
ential choice weighting procedure similar to the one used in this investi- 
gation* 

The purpose of th-^ reliability study was to compare the parallel* 
forms reliability coefficients of two forms (C and D) of an experimental 
reading-skills test when scores were obtained by four different methods. 
Two of the methods were: 1) "n^umb^^r-right scoring" where, for each item, 
examinees received 1 point for a correct response and 0 for any other 
response » and; 2) "formula-scoring" that involved a correction for chance 
success* For each Item, examinees received 1 point for a correct response, 
0 for omission, and -l/(k-l) points for an incorrect response. These 
scoring methods do, in a sense, weight the response alternatives differ-* 
entially* Both are commonly employed in the scoring of aptitude tests 
and require no explanation of the background upon which they are based. 
In the reliability study these test-scoi ng methods were labeled Wl and 
W2, respectively. 

Two other methods of test scoring under study in this investi- 
gation involved the differential weighting of item choices. The two 
methods were: 1) response-category-weight scoring which involved cross- 
validated weights for every item response category including omission, 
and; 2) "adjusted" response-category-weight scoring which involved 
cross-validated weights for every item-response category including 
omission after the weights have been adjusted by means of cross-vali- 
dated partial regression coefficients for predicting a defined crite- 
rion. These test-scoring methods were lableled W3 and W4, respectively. 

The method of weighting item-response categories that was used in 
this investigation was described some time ago by Gu tman (1941). 
Guttman showed that to maximize the relationship betv.cen a criterion 
and the response categories for any given test item the weight for eac.. 
category should be linearly related to the mean criterion score of 
persons who select that category. This weighting procedure was one 
of the test-scoring methods used in this study Jlabeled W3) and was the 
basis for another (labeled W4) . 

Although the procedure descritred by Guttman leads to a relationship 
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between a criterion and the response categories in a single test item 
thS is at a maximum, it does not necessarily lead to a relationship 

Tat is maximized lor that criterion and a -'^^-/^^^'^-^^/.^^^..S! 
scores for a series of test items are to be summed to obtain a total 
test score Lr a person, the relationship between the total-test scores 
and criterion scores will tend to be at a maximum when each item is 
weighted by the appropriate partial regression coefficient. The m 
tesl-scoring methS consisted of the procedure described by Guttman 
p!us the muftiple regression procedure. The combination of the two 
weighting procedures leads to response-category weights for an item 
thafare^-adjusted" by the partial regression coefficient for the item 
containing the response categories. 

Methods W3 and WA required that, for each examinee, a total- 
test score on both forms of the test be available. This was necessary 
because the weight determined for each response category of an item in 
a test form was proportional to the mean score on the <^°"^«P°f ^^^^^ 
parallel form of all examinees selecting that response "^^8°^^ • J^^^^ 
the mean total-test score on Form D of those examinees who fell in each 
^esponS category for each item in Form C of the test was calculated. 
Analogous scoring weights were obtained for the categories xn Form D. 

In another sample the test items in each form for ^^^^ examinee 
were scored using the obtained response-category weights. Each of the 
96 items in one form, scored by response-category weights, was "eated 
^ an'ndependent vakable for predicting total score on the correspond- 
ing parallel form of the test. The regression of the parallel-form 
test score upon the 96 items in the corresponding form produced a 
partial regression coefficient for each item in Tests C ^nd D. The 
weight for each response category within a test item was ^^^iplied by 
the partial regression coefficient for the item. The products were 
temer'adjustfd response-category weights." This procedure provided 
the weights required for the W4 method of scoring. 

In another sample of examinees for whom data on both Forms C 
and D we?e available, scores for each test form for each examinee were 
^btaL^by methods W1,.W2. W3, an. W4. Certain intercorrelations of 
the eight scores may be interpreted as parallel-forms "JJ^il^^^y 
ficients for Forms C and D scored by each of the four methods. 

Statistical cciiparisons revealed that for Form C and D : 
n No difference was found between the parallel-forms reliability 
coSffcieS of "nuSer-rights" scores (method Wl) and scores corrected 

'^Irorinrwftrwa'^e'iStf f^r each response category resulted in a 
significlnfifcrease in p'arallel-f orms reliability over that of scoring 
with a correction for chance success (method WZ) ; reliability 
I'i srorine with W4 weights for each item choice yielded a reliability 
coefficJIn?foi the resulting scores that was significantly lower than 
the reSability coefficient for scores corrected for chance success 
(method W2) ; 
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4) Scoring with W4 weights for each item choice yielded scores 
significantly less reliable than scores yielded by method W3. 



Sunmary of the Predictive Validity Study 

The objective of the predictive validity study was to compare 
the predictive validity coefficients of the Davis Reading Test 
(Series 1, Form D) when scores were obtained by four different test 
scoring methods. The criterion scores in this study were first- 
semester grade-point averages for university freshmen. The four 
test-scoring methods compared were: 1) number-right scoring; 
2) scoring using a correction for chance success. These two scoring 
methods are identical to those used in the reliability study;. 3) 
scoring with weights for each response category plus omission and 
"not read" (omitting an item because of lack of sufficient time to 
consider the item) , and A) scoring with weights for every response 
category for each item "adjusted" by the appropriate partial regres- 
sion coefficient. These methods were labeled Wl, W2, W3, and W4, 
respectively* 

The predictive validity study differed from the reliability 
study in two important ways* First, the response-category weighting 
procedures differentiated between omission and "not read" (failure 
to mark any choice as correct because of lack of sufficient time to 
consider the item)* Second, the criterion scores used to determine 
response-category weights were not test scores, but were first- 
semester grade-point averages for freshmen at the University of 
Pennsylvania. 

Three samples of examinees were required in the predictive 
validity study. Using method W3, response-category weights were 
established in one sample according to the Guttman response-category 
weighting procedure described elsewhere in this report. In a second 
sample drawn from the same parent population, grade — point averages 
were regressed, in a multiple linear regression, on scores for each of 
the 80 items in the Davis Reading Test. Each of the test items had 
been scored using the response-category weights obtained in the first 
sample of examinees. The required "adjusted response-category weights" 
were obtained by multiplying each weight in an item by the partial 
regression coefficient for that item. This procedure had been labeled 
method W4. Since each step in the determination of the response- 
category weights and regression coefficients involved an independent 
sample of examinees, the obtained weights and coefficients were free 
from spuriousness caused by capitalization on errors within the samples 
in which the weights and coefficients were obtained. Scores on the 
Davis Reading Test in a third sample of examinees were obtained using 
the four test-scoring methods. Predictive validity coefficients for 
the test scored in each manner were obtained by correlating test scores 
with grade-point averages. 
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Planned statistical comparisons between selected pairs of validity 
coefficients revealed that: 

1) No significant difference was found in the predictive validity for 
^'number-right" scores (method Wl) and scores corrected for chance success 
(method W2) ; 

2) No significant difference was found in the predictive validity of 
scores obtained by applying W3 weights for each item-response category 
and for scores corrected for chance success (method W2) ; 

3) Scoring with "adjusted response-category" weights (method W4) 
resulted in a significantly higher predictive validity coefficient than 
scoring with a correction for chance success (method W2) ; 

4) Scoring with "adjusted response-category" weights (method W4) 
resulted in a significantly higher predictive validity coefficient them 
scoring with W3 weights for each response category. 



Discussion and Conclusions of the Reliability Study 

As shown in Table 11, the parallel-forms reliability coefficients 
of scores obtained by scoring methods Wl, W2, W3, and W4 were .882, 
•883, .894, and .794, respectively. 

The fact that methods Wl and W2 yielded scores that were virtually 
identical with respect to their reliability coefficients had been 
expected because the tests had been administered under generous time 
limits that permitted every examinee to consider every item and because 
the directions included the sentence "Mark items even if you are not 
sure of the answers, but avoid wild guessing." 

Because the use of differential choice weights obtained by Guttman's 
procedure (Guttman, 1941) allows the variance generated by use of partial 
information and misinformation in the marking of answers to items to 
which an examinee is not sure of the correct answer to be included in 
test scores, it was expected that the reliability coefficient of W3 
scores would be higher than that of either Wl or W2 scores. This expec- 
tation was realized* 

On the other hand, the a-priori expectation that W4 scores would 
have a higher reliability coefficient than W3 scores was not realized. 
Instead, as noted above, the parallel -forms reliability coefficient of 
W4 scores was significantly lower than that of the W3 scores by about .1. 
An adequate explanation of this phenomenon has simply not as yet been 
formulated. 

In general, it may be concluded that differential choice weights 
for item-response categories are useful for improving reliability per 
unit of test length. This confirms most previous studies pertaining to 
this point (Davis & Fifer, 1959; Hendrickson, 1971; Reilly & Jackson, 
1972). 
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Dlsciission and Conclusions of the Predictive Validity Study 

Table 17 shows that the validity coefficients of scores obtained 
by methods Wl, W2, W3, and W4 are .297, .302, .298, and .407 respec- 
tively* 

The a-priori expectation that W2 scores would be more valid 
than Wl scores was not realized. Changes in predictive validity 
produced by scoring with a correction f "r chance success are usually 
small and, unless very large numbers are available, it is difficult 
to demonstrate statistically significant differences. Although the 
difference in the validity coefficients as a result of W2 versus Wl 
scoring was positive (^^'•OOS), the test was not statistically sig- 
nificant. The lack of a significant difference is due, in part, to 
the high correlation between the two types of scores. Large dif- 
ferences must occur for the difference to be statistically significant. 

The directions for the Davis Reading Test include a statement 
against guessing wildly from among the choices if the correct answer 
is not known. Because of this the behavior of some examinees tends 
to be more cautious thus eliminating some variance in the scores due 
to guessing. This effect would apply to "number-right" scores (Wl) 
as well as the "formula score" (W2). 

The expectation that W3 scores would be more valid than W2 
scores was not realized. One reason for the lack of improvement in 
the predictive validity as a result of W3 scoring might well hm due 
to the importance that omitted and "not read" items assume in the 
weighting scheme. The Speed score in the Davis Reading Test indicates 
basically the rapidity and accuracy with which the examinee understands 
the kinds of material ordinarily required at the college level. 
Perhaps the W3 method alters the characteristics of the test in such 
a way as to increase t;he importance of the speed factor. W3 scoring 
might "refine" the measurement of speed of comprehension to a much 
greater extent than that obtained by eithet W2 or Wl scoring. 
Hendrickson (1971) has suggested that the factor structure of a test 
might be altered as a result of Guttman response-category weighting. 
Further, speed of comprehension as a reading skill may account for 
less variance in the criterion (grade-point average) than other factors 
measured in the weighted test. 

The decrease in predictive validity that seemed to result from 
Guttman response-category weighting using grade-point average as the 
criterion was compensated for by W4 scoring. By "adjusting" the 
response-category weights by the appropriate partial regression co- 
efficients, the effects balance each other out. W4 scoring weights 
more heavily those items that account for the greatest amount of 
variance in the criterion. Improvement in test validity due to W4 
scoring was expected. 

The use of response-category weighting rests upon the important 



64 



consideration that the item options be sufficiently well-written 
and refined to accurately measure the various degrees of partial 
Information held by examinees. Davis (1959) has emphasized the 
point that Improvement in reliability and, presumably validity, 
is attributed to the selection among incorrect options by examinees 
who are unable to select the keyed option. 

With regard to responses-category weighting and item weighting 
several points must be considered. First, weights should be 
established using large samples of examinees to insure stability 
o£ the weights upon cross-*validation. Second, consideration should 
be given to the magnitude of the weights assigned to incorrect and 
omit categories. The dilemma posed ^when the weight for the keyed 
category is less than the weight for an incorrect category should 
be resolved. This point is especially Important ir. light of the 
comments by Green (1972) about the ethical, problems posed by direc- 
tions about omitting items when in doubt. Frederick B. Dayis 
(personal communication) has suggested that the test directions 
should convey to examinees the nature of the test scoring procedure. 
Davis has also suggested that the correct category, omit and, 
perhaps **not read" categories'" each receive standard weights and 
Incorrect categories receive differential weights. A refinement 
of the empirical weights through a .procedure similar to this might 
overcome the ethical problems cited by Green (1972) . 

Although the results of the reliability and predictive validity 
studies are mixed » the evidence points to the value of response- 
category weighting for Improving test reliability. The value of 
response-category weighting for Improving predictive validity is 
less apparent. The application of response-category weighting with 
item weighting holds promise as a means of Improving predictive 
validity. Rirthet research in this area is, however, required before 
a definitive statement about the overall value of response^catagory 
weighting can be.made^ 
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